View Issue Details

IDProjectCategoryView StatusLast Update
0001565OpenMPTUser Interfacepublic2022-03-15 09:20
ReporterProgrammerNerd Assigned Tomanx  
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
PlatformArch LinuxOSLinux 
Product VersionOpenMPT 1.30.02.00 / libopenmpt 0.6.1 (upgrade first) 
Target VersionOpenMPT 1.30.03.00 / libopenmpt 0.6.2 (upgrade first)Fixed in VersionOpenMPT 1.30.03.00 / libopenmpt 0.6.2 (upgrade first) 
Summary0001565: Jester's epilepsy.mod breaks the openmpt123 user interface
Description

When playing epilepsy.mod everything after the title gets messed up.

Here is an example of what it looks like:

openmpt123 v0.6.1, libopenmpt 0.6.1+r16764.pkg (OpenMPT 1.30.02.00 https://source.openmpt.org/svn/openmpt/tags/libopenmpt-0.6.1@16764 (2022-01-30T16:49:19.812343Z) clean-pkg)
         L : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:            :            ://lib.openmpt.org/>
         R : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>:            :            
Settings...: Gain: 0 dB   Stereo: 0 %   Filter: 8 taps   Ramping: -1   
Mixer......: CPU::0.06%   Chn:::1   
Player.....: Ord:::0/:14 Pat:::2 Row:::3   Spd::5 Tmp:125   
Position...: 00:00.350 / 01:29.599   mpatible
Title......:  199

For comparison here is what I get when playing any other module:

openmpt123 v0.6.1, libopenmpt 0.6.1+r16764.pkg (OpenMPT 1.30.02.00 https://source.openmpt.org/svn/openmpt/tags/libopenmpt-0.6.1@16764 (2022-01-30T16:49:19.812343Z) clean-pkg)
Copyright (c) 2013-2022 OpenMPT Project Developers and Contributors <https://lib.openmpt.org/>

Filename...: cyberride.mod
Size.......: 199kB
Type.......: mod (ProTracker MOD (M.K.))
Tracker....: Generic ProTracker or compatible
Title......: cyberride
Duration...: 04:48.820
Subsongs...: 1
Channels...: 4
Orders.....: 61
Patterns...: 41
Instruments: 0
Samples....: 31

         L : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    :          :            
         R : >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    :          :            
Settings...: Gain: 0 dB   Stereo: 0 %   Filter: 8 taps   Ramping: -1   
Mixer......: CPU::0.12%   Chn:::5   
Player.....: Ord:::0/:61 Pat:::0 Row:::5   Spd::5 Tmp:125   
Position...: 00:00.549 / 04:48.820   
Steps To Reproduce
  1. Download: https://modarchive.org/index.php?request=view_by_moduleid&query=106265

  2. Play it using openmpt123 using: openmpt123 ./epilepsy.mod

  3. Notice how the user interface looks different than when playing other modules.

TagsNo tags attached.
Has the bug occurred in previous versions?Yes
Tested code revision (in case you know it)

Relationships

related to 0001567 assignedmanx openmpt123 and other libopenmpt clients should filter control characters 
related to 0000569 assignedmanx Unicode strings in CSoundFile. 

Activities

ProgrammerNerd

ProgrammerNerd

2022-02-07 05:44

reporter   ~0005043

The link to modarchive.org is not correct. The ampersand got replaced with "&". Replace "&" with & or search for epilepsy.mod. There is only one result when doing so.

ProgrammerNerd

ProgrammerNerd

2022-02-07 05:45

reporter   ~0005044

When I typed & amp ; (without the spaces) it replaced it with an ampersand. I can't edit my comment so I'm posted this. Sorry for all these notes. This is my first time using the issue tracker.

Saga Musix

Saga Musix

2022-02-07 09:15

administrator   ~0005045

I would assume this to be caused by byte 9E being present in the song title, which is transformed into unicode code point U+0093, which is part of the C1 controls:

In Unicode, the characters U+0080 to U+009F are control characters and should not be used for display. Under Windows, however, browsers can use these characters from the Windows 1252 ANSI character set

On Windows, openmpt123 displays the character as [?]. I guess our transcoding tables should avoid the C1 block and transcode this particular character as C5 BE instead.

ProgrammerNerd

ProgrammerNerd

2022-02-08 01:15

reporter   ~0005048

Yes, that makes sense because it displays the 199 but it doesn't display the Z with an accent after it like OpenMPT does.

I am wondering what would happen if this the module were loaded into Protracker on an Amiga? Would it also display a 'Z' (with an accent) or does 0x9E mean something else on an Amiga?

ProgrammerNerd

ProgrammerNerd

2022-02-08 03:47

reporter   ~0005049

Here I opened the module in Protracker 2.3D. Attached is a screenshot. It looks like the Z doesn't show on my setup. I will note that I'm not familiar with what options the Amiga had for international users and if Jester being from Germany would have installed some language pack or something similar in concept and if that would have an impact on Protracker.

protracker-screenshot.png (5,046 bytes)   
protracker-screenshot.png (5,046 bytes)   
manx

manx

2022-02-08 07:35

administrator   ~0005050

ISO8859-1 byte 0x9e corresponds to Unicode code point U+009e, not U+0093. There is no modification of value here because ISO8859-1 bytes map 1:1 to Unicode code points.

OpenMPT may display žbecause OpenMPT does not handle character encoding properly for the song title (yet, see 0000569). ž is what 9e maps to in Windows-1252 encoding (which is the default for Windows installs in Western Languages).

I do not think that libopenmpt should filter anything here, as there may be legitimate reasons for module formats supporting Unicode text to include Unicode Control Characters. libopenmpt is documented as returning "UTF8" and not "UTF8 without control characters" or anything similar.

openmpt123 should filter any such Unicode Control Characters when displaying them. However, that is not as easy as it sounds. Unicode splatters all kinds of harmless and problematic control characters all over its code point space. If we just filter C1 and some C0, we would still miss a lot. A proper solution would make depend openmpt123 on a full Unicode library like libicu. I am actually not opposed to make openmpt123 depend on libicu, however that is not something that could be implemented in a timely manner.

I guess we could just filter C1 and most of C0 for now, as there are only few formats supporting UTF8, this would likely be a fair tradeoff (for now).

Saga Musix

Saga Musix

2022-02-08 08:13

administrator   ~0005051

Sorry, of course I mean U+009E, not U+00093. From my understanding, the codepoints in C1 have alternative encodings which do not double as control characters such as PRIVACY MESSAGE (PM) in this case, in which case it would be possible to find an encoding that doesn't cause the terminal to do strange things.

I am wondering what would happen if this the module were loaded into Protracker on an Amiga? Would it also display a 'Z' (with an accent) or does 0x9E mean something else on an Amiga?

Amiga in general uses ISO-8859-1 but ProTracker bitmap fonts may not support characters beyond ASCII, or not all of them. In general, it's impossible to know which tracker a MOD file was made in, so we cannot rely on which characters ProTracker does and does not support, because a MOD file might just as well have been made in another tracker.

manx

manx

2022-02-08 09:08

administrator   ~0005052

As far as Unicode and official mapping tables are concerned, I am not aware of any alternative mappings from ISO-8859-1 to Unicode.

However, looking at some Amiga Bitmap Fonts that are used in the ANSI scene (I looked at the Ansilove/PHP package), some fonts have duplicates of the range 0xc0..0xdf in the range 0x80..0x9f that are color-inverted. I guess it might make sense to assume the same mapping for Amiga trackers, however in this particular and likely most common case (ProTracker), this mapping would also be just wrong according to the screenshot (it would display Þ(Latin Capital Letter Thorn)).

Other Amiga Bitmap Fonts have this range either blank or all empty squares.

Saga Musix

Saga Musix

2022-02-08 10:51

administrator   ~0005053

Right, so as the only difference between standard Latin-1 and Windows-1252 are the characters in range 0x80..0x9F being printable, I wonder if we should assume Windows-1252 to be the encoding used in MOD files (as that would allow e.g. umlauts from OpenMPT to keep working), and for other formats using Latin-1, we stop mapping 0x80..0x9F to control characters.

Obviously this wouldn't cause the title of this particular module to be displayed "correctly", but it's hard to reason what is "correct" here anyway. To me it looks like this 0x9E byte isn't supposed to be there anyway, and is maybe caused by a corrupted file transfer.

manx

manx

2022-02-08 12:28

administrator   ~0005054

Right, so as the only difference between standard Latin-1 and Windows-1252 are the characters in range 0x80..0x9F being printable, I wonder if we should assume Windows-1252 to be the encoding used in MOD files (as that would allow e.g. umlauts from OpenMPT to keep working), and for other formats using Latin-1, we stop mapping 0x80..0x9F to control characters.

I guess we could do that. According to <https://en.wikipedia.org/wiki/ISO/IEC_8859-1>, the difference would basically come down to using ISO/IEC 8859-15:1999 instead of IANA/MIME ISO-8859-1.
In the implementation, the only thing that would need to be done, is add support for this currently unsupported encoding, and changing the default encodings for file formats.

However, in any case, openmpt123 should probably still filter known control characters.

Saga Musix

Saga Musix

2022-02-08 19:37

administrator   ~0005055

I think ISO 8859-15 isn't quite what we want here, as it replaces existing characters in ISO-8859-1 rather than adding them to the 0x80..0x9F range like Windows-1252. See this table: <https://en.wikipedia.org/wiki/ISO/IEC_8859-15#Codepage_layout>

manx

manx

2022-02-08 19:39

administrator   ~0005056

And that would have been my typo here. I meant to say (or copy) ISO/IEC 8859-1:1998.

manx

manx

2022-02-08 19:42

administrator   ~0005057

ISO/IEC 8859-1:1998 being the one which leaves the range undefined. IANA/MIME ISO-8859-1 defines it to be C1.

manx

manx

2022-02-08 20:04

administrator   ~0005058

There are also <https://www.iana.org/assignments/charset-reg/Amiga-1251> and <https://de.wikipedia.org/wiki/Commodore_Amiga_(Zeichensatz)> (german), but both do not solve our issue here (both use C1). It probably makes sense to implement this as a distinct Amiga encoding anyway instead of using any generic 8859-1 encoding for Amiga formats, due to the soft-hyphen and del differences.

manx

manx

2022-02-10 10:46

administrator   ~0005061

Last edited: 2022-02-10 15:09

So the current plan appears to be:

  • 1) implement Amiga charset as per <https://de.wikipedia.org/wiki/Commodore_Amiga_(Zeichensatz)> (trunk)
  • 2) implement Amiga_NoC1 charset (trunk and 0.6)
  • 3) switch all Amiga file formats to default to Amiga_NoC1 charset (trunk and 0.6)
  • 4) implement _NoC1 variants also for ISO88591-1 and ISO8859-15 for symmetry reasons (trunk)
  • 5) implement generic C1 filtering in openmpt123 (future work)
  • 6) implement generic Unicode Control Characters filtering in openmpt123 (even further future work, requires libicu)

The remaining questions is:

  • What do we map C1 control codes to? Possible options are:
    • A) Windows-1252 (might make sense for compatibility with existing OpenMPT MOD files, makes no sense for any format that OpenMPT cannot export)
    • B) 0xc0..0xdf (as done in some Amiga fonts popular for ANSI graphics)
    • C) SPACE (U+0020)
    • D) REPLACEMENT CHARACTER (U+FFFD)
    • E) QUESTION MARK (U+003F) (replacement character inside the source character set)
    • F) ZERO WIDTH SPACE (U+200B) (somewhat likely to cause further issues down the line)
    • G) nothing

A, C, and E will be equally easy to implement.
B, D, and F require special code for converting to the encoding because our existing conversion algorithm starts at 0x20 and goes upwards for finding the first match.
G requires special case code for both, from encoding, and to encoding.

I'm not sure if any of this warrants back-porting to earlier release branches, especially since I did rewrite all of charset conversions for 0.6, which basically requires implementing it twice for back-porting.

manx

manx

2022-02-10 15:08

administrator   ~0005062

Last edited: 2022-02-10 15:09

status update:

  • 1) has been implemented in r16902 and r16904.
  • 2) and 4) are implemented using option D) by the attached patch.
bug1565-implement-no-c1-charsets-v1.patch (14,800 bytes)   
Index: common/mptString.cpp
===================================================================
--- common/mptString.cpp	(revision 16908)
+++ common/mptString.cpp	(working copy)
@@ -357,18 +357,21 @@
 	switch(charset)
 	{
 #if defined(MPT_ENABLE_CHARSET_LOCALE)
-		case Charset::Locale:      return mpt::encode<Tdststring>(mpt::logical_encoding::locale, src); break;
+		case Charset::Locale:           return mpt::encode<Tdststring>(mpt::logical_encoding::locale, src); break;
 #endif
-		case Charset::UTF8:        return mpt::encode<Tdststring>(mpt::common_encoding::utf8, src); break;
-		case Charset::ASCII:       return mpt::encode<Tdststring>(mpt::common_encoding::ascii, src); break;
-		case Charset::ISO8859_1:   return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_1, src); break;
-		case Charset::ISO8859_15:  return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_15, src); break;
-		case Charset::CP850:       return mpt::encode<Tdststring>(mpt::common_encoding::cp850, src); break;
-		case Charset::CP437:       return mpt::encode<Tdststring>(mpt::common_encoding::cp437, src); break;
-		case Charset::CP437AMS:    return mpt::encode<Tdststring>(CharsetTableCP437AMS, src); break;
-		case Charset::CP437AMS2:   return mpt::encode<Tdststring>(CharsetTableCP437AMS2, src); break;
-		case Charset::Windows1252: return mpt::encode<Tdststring>(mpt::common_encoding::windows1252, src); break;
-		case Charset::Amiga:       return mpt::encode<Tdststring>(mpt::common_encoding::amiga, src); break;
+		case Charset::UTF8:             return mpt::encode<Tdststring>(mpt::common_encoding::utf8, src); break;
+		case Charset::ASCII:            return mpt::encode<Tdststring>(mpt::common_encoding::ascii, src); break;
+		case Charset::ISO8859_1:        return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_1, src); break;
+		case Charset::ISO8859_15:       return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_15, src); break;
+		case Charset::CP850:            return mpt::encode<Tdststring>(mpt::common_encoding::cp850, src); break;
+		case Charset::CP437:            return mpt::encode<Tdststring>(mpt::common_encoding::cp437, src); break;
+		case Charset::CP437AMS:         return mpt::encode<Tdststring>(CharsetTableCP437AMS, src); break;
+		case Charset::CP437AMS2:        return mpt::encode<Tdststring>(CharsetTableCP437AMS2, src); break;
+		case Charset::Windows1252:      return mpt::encode<Tdststring>(mpt::common_encoding::windows1252, src); break;
+		case Charset::Amiga:            return mpt::encode<Tdststring>(mpt::common_encoding::amiga, src); break;
+		case Charset::ISO8859_1_no_C1:  return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_1_no_c1, src); break;
+		case Charset::ISO8859_15_no_C1: return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_15_no_c1, src); break;
+		case Charset::Amiga_no_C1:      return mpt::encode<Tdststring>(mpt::common_encoding::amiga_no_c1, src); break;
 	}
 	return Tdststring();
 }
@@ -383,18 +386,21 @@
 	switch(charset)
 	{
 #if defined(MPT_ENABLE_CHARSET_LOCALE)
-		case Charset::Locale:      return mpt::decode<Tsrcstring>(mpt::logical_encoding::locale, src); break;
+		case Charset::Locale:           return mpt::decode<Tsrcstring>(mpt::logical_encoding::locale, src); break;
 #endif
-		case Charset::UTF8:        return mpt::decode<Tsrcstring>(mpt::common_encoding::utf8, src); break;
-		case Charset::ASCII:       return mpt::decode<Tsrcstring>(mpt::common_encoding::ascii, src); break;
-		case Charset::ISO8859_1:   return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_1, src); break;
-		case Charset::ISO8859_15:  return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_15, src); break;
-		case Charset::CP850:       return mpt::decode<Tsrcstring>(mpt::common_encoding::cp850, src); break;
-		case Charset::CP437:       return mpt::decode<Tsrcstring>(mpt::common_encoding::cp437, src); break;
-		case Charset::CP437AMS:    return mpt::decode<Tsrcstring>(CharsetTableCP437AMS, src); break;
-		case Charset::CP437AMS2:   return mpt::decode<Tsrcstring>(CharsetTableCP437AMS2, src); break;
-		case Charset::Windows1252: return mpt::decode<Tsrcstring>(mpt::common_encoding::windows1252, src); break;
-		case Charset::Amiga:       return mpt::decode<Tsrcstring>(mpt::common_encoding::amiga, src); break;
+		case Charset::UTF8:             return mpt::decode<Tsrcstring>(mpt::common_encoding::utf8, src); break;
+		case Charset::ASCII:            return mpt::decode<Tsrcstring>(mpt::common_encoding::ascii, src); break;
+		case Charset::ISO8859_1:        return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_1, src); break;
+		case Charset::ISO8859_15:       return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_15, src); break;
+		case Charset::CP850:            return mpt::decode<Tsrcstring>(mpt::common_encoding::cp850, src); break;
+		case Charset::CP437:            return mpt::decode<Tsrcstring>(mpt::common_encoding::cp437, src); break;
+		case Charset::CP437AMS:         return mpt::decode<Tsrcstring>(CharsetTableCP437AMS, src); break;
+		case Charset::CP437AMS2:        return mpt::decode<Tsrcstring>(CharsetTableCP437AMS2, src); break;
+		case Charset::Windows1252:      return mpt::decode<Tsrcstring>(mpt::common_encoding::windows1252, src); break;
+		case Charset::Amiga:            return mpt::decode<Tsrcstring>(mpt::common_encoding::amiga, src); break;
+		case Charset::ISO8859_1_no_C1:  return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_1_no_c1, src); break;
+		case Charset::ISO8859_15_no_C1: return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_15_no_c1, src); break;
+		case Charset::Amiga_no_C1:      return mpt::decode<Tsrcstring>(mpt::common_encoding::amiga_no_c1, src); break;
 	}
 	return mpt::widestring();
 }
Index: common/mptString.h
===================================================================
--- common/mptString.h	(revision 16908)
+++ common/mptString.h	(working copy)
@@ -68,6 +68,10 @@
 
 	Amiga,
 
+	ISO8859_1_no_C1,
+	ISO8859_15_no_C1,
+	Amiga_no_C1,
+
 #if defined(MPT_ENABLE_CHARSET_LOCALE)
 	Locale, // CP_ACP on windows, current C locale otherwise
 #endif // MPT_ENABLE_CHARSET_LOCALE
Index: src/mpt/string/types.hpp
===================================================================
--- src/mpt/string/types.hpp	(revision 16908)
+++ src/mpt/string/types.hpp	(working copy)
@@ -37,6 +37,9 @@
 	cp437,
 	windows1252,
 	amiga,
+	iso8859_1_no_c1,
+	iso8859_15_no_c1,
+	amiga_no_c1,
 };
 
 
Index: src/mpt/string_transcode/transcode.hpp
===================================================================
--- src/mpt/string_transcode/transcode.hpp	(revision 16908)
+++ src/mpt/string_transcode/transcode.hpp	(working copy)
@@ -43,27 +43,25 @@
 
 
 
-/*
-default 1:1 mapping
+
+// default 1:1 mapping
 inline constexpr char32_t CharsetTableISO8859_1[256] = {
-	0x0000,0x0001,0x0002,0x0003,0x0004,0x0005,0x0006,0x0007,0x0008,0x0009,0x000a,0x000b,0x000c,0x000d,0x000e,0x000f,
-	0x0010,0x0011,0x0012,0x0013,0x0014,0x0015,0x0016,0x0017,0x0018,0x0019,0x001a,0x001b,0x001c,0x001d,0x001e,0x001f,
-	0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0027,0x0028,0x0029,0x002a,0x002b,0x002c,0x002d,0x002e,0x002f,
-	0x0030,0x0031,0x0032,0x0033,0x0034,0x0035,0x0036,0x0037,0x0038,0x0039,0x003a,0x003b,0x003c,0x003d,0x003e,0x003f,
-	0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,0x0047,0x0048,0x0049,0x004a,0x004b,0x004c,0x004d,0x004e,0x004f,
-	0x0050,0x0051,0x0052,0x0053,0x0054,0x0055,0x0056,0x0057,0x0058,0x0059,0x005a,0x005b,0x005c,0x005d,0x005e,0x005f,
-	0x0060,0x0061,0x0062,0x0063,0x0064,0x0065,0x0066,0x0067,0x0068,0x0069,0x006a,0x006b,0x006c,0x006d,0x006e,0x006f,
-	0x0070,0x0071,0x0072,0x0073,0x0074,0x0075,0x0076,0x0077,0x0078,0x0079,0x007a,0x007b,0x007c,0x007d,0x007e,0x007f,
-	0x0080,0x0081,0x0082,0x0083,0x0084,0x0085,0x0086,0x0087,0x0088,0x0089,0x008a,0x008b,0x008c,0x008d,0x008e,0x008f,
-	0x0090,0x0091,0x0092,0x0093,0x0094,0x0095,0x0096,0x0097,0x0098,0x0099,0x009a,0x009b,0x009c,0x009d,0x009e,0x009f,
-	0x00a0,0x00a1,0x00a2,0x00a3,0x00a4,0x00a5,0x00a6,0x00a7,0x00a8,0x00a9,0x00aa,0x00ab,0x00ac,0x00ad,0x00ae,0x00af,
-	0x00b0,0x00b1,0x00b2,0x00b3,0x00b4,0x00b5,0x00b6,0x00b7,0x00b8,0x00b9,0x00ba,0x00bb,0x00bc,0x00bd,0x00be,0x00bf,
-	0x00c0,0x00c1,0x00c2,0x00c3,0x00c4,0x00c5,0x00c6,0x00c7,0x00c8,0x00c9,0x00ca,0x00cb,0x00cc,0x00cd,0x00ce,0x00cf,
-	0x00d0,0x00d1,0x00d2,0x00d3,0x00d4,0x00d5,0x00d6,0x00d7,0x00d8,0x00d9,0x00da,0x00db,0x00dc,0x00dd,0x00de,0x00df,
-	0x00e0,0x00e1,0x00e2,0x00e3,0x00e4,0x00e5,0x00e6,0x00e7,0x00e8,0x00e9,0x00ea,0x00eb,0x00ec,0x00ed,0x00ee,0x00ef,
-	0x00f0,0x00f1,0x00f2,0x00f3,0x00f4,0x00f5,0x00f6,0x00f7,0x00f8,0x00f9,0x00fa,0x00fb,0x00fc,0x00fd,0x00fe,0x00ff
-};
-*/
+	0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f,
+	0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001a, 0x001b, 0x001c, 0x001d, 0x001e, 0x001f,
+	0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002a, 0x002b, 0x002c, 0x002d, 0x002e, 0x002f,
+	0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003a, 0x003b, 0x003c, 0x003d, 0x003e, 0x003f,
+	0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004a, 0x004b, 0x004c, 0x004d, 0x004e, 0x004f,
+	0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005a, 0x005b, 0x005c, 0x005d, 0x005e, 0x005f,
+	0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006a, 0x006b, 0x006c, 0x006d, 0x006e, 0x006f,
+	0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0x007f,
+	0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085, 0x0086, 0x0087, 0x0088, 0x0089, 0x008a, 0x008b, 0x008c, 0x008d, 0x008e, 0x008f,
+	0x0090, 0x0091, 0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097, 0x0098, 0x0099, 0x009a, 0x009b, 0x009c, 0x009d, 0x009e, 0x009f,
+	0x00a0, 0x00a1, 0x00a2, 0x00a3, 0x00a4, 0x00a5, 0x00a6, 0x00a7, 0x00a8, 0x00a9, 0x00aa, 0x00ab, 0x00ac, 0x00ad, 0x00ae, 0x00af,
+	0x00b0, 0x00b1, 0x00b2, 0x00b3, 0x00b4, 0x00b5, 0x00b6, 0x00b7, 0x00b8, 0x00b9, 0x00ba, 0x00bb, 0x00bc, 0x00bd, 0x00be, 0x00bf,
+	0x00c0, 0x00c1, 0x00c2, 0x00c3, 0x00c4, 0x00c5, 0x00c6, 0x00c7, 0x00c8, 0x00c9, 0x00ca, 0x00cb, 0x00cc, 0x00cd, 0x00ce, 0x00cf,
+	0x00d0, 0x00d1, 0x00d2, 0x00d3, 0x00d4, 0x00d5, 0x00d6, 0x00d7, 0x00d8, 0x00d9, 0x00da, 0x00db, 0x00dc, 0x00dd, 0x00de, 0x00df,
+	0x00e0, 0x00e1, 0x00e2, 0x00e3, 0x00e4, 0x00e5, 0x00e6, 0x00e7, 0x00e8, 0x00e9, 0x00ea, 0x00eb, 0x00ec, 0x00ed, 0x00ee, 0x00ef,
+	0x00f0, 0x00f1, 0x00f2, 0x00f3, 0x00f4, 0x00f5, 0x00f6, 0x00f7, 0x00f8, 0x00f9, 0x00fa, 0x00fb, 0x00fc, 0x00fd, 0x00fe, 0x00ff};
 
 inline constexpr char32_t CharsetTableISO8859_15[256] = {
 	0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f,
@@ -206,6 +204,60 @@
 }
 
 template <typename Tsrcstring>
+inline mpt::widestring decode_8bit_no_c1(const Tsrcstring & str, const char32_t (&table)[256], mpt::widechar replacement = MPT_WIDECHAR('\uFFFD')) {
+	mpt::widestring res;
+	res.reserve(str.length());
+	for (std::size_t i = 0; i < str.length(); ++i) {
+		std::size_t c = static_cast<std::size_t>(mpt::char_value(str[i]));
+		if ((0x80 <= c) && (c <= 0x9f)) {
+			res.push_back(replacement);
+		} else if (c < std::size(table)) {
+			res.push_back(static_cast<mpt::widechar>(table[c]));
+		} else {
+			res.push_back(replacement);
+		}
+	}
+	return res;
+}
+
+template <typename Tdststring>
+inline Tdststring encode_8bit_no_c1(const mpt::widestring & str, const char32_t (&table)[256], char replacement = '?') {
+	Tdststring res;
+	res.reserve(str.length());
+	for (std::size_t i = 0; i < str.length(); ++i) {
+		char32_t c = static_cast<char32_t>(str[i]);
+		bool found = false;
+		// Try non-control characters first.
+		// In cases where there are actual characters mirrored in this range (like in AMS/AMS2 character sets),
+		// characters in the common range are preferred this way.
+		for (std::size_t x = 0x20; x < std::size(table); ++x) {
+			if ((0x80 <= c) && (c <= 0x9f)) {
+				continue;
+			}
+			if (c == table[x]) {
+				res.push_back(static_cast<typename Tdststring::value_type>(static_cast<uint8>(x)));
+				found = true;
+				break;
+			}
+		}
+		if (!found) {
+			// try control characters
+			for (std::size_t x = 0x00; x < std::size(table) && x < 0x20; ++x) {
+				if (c == table[x]) {
+					res.push_back(static_cast<typename Tdststring::value_type>(static_cast<uint8>(x)));
+					found = true;
+					break;
+				}
+			}
+		}
+		if (!found) {
+			res.push_back(replacement);
+		}
+	}
+	return res;
+}
+
+template <typename Tsrcstring>
 inline mpt::widestring decode_ascii(const Tsrcstring & str, mpt::widechar replacement = MPT_WIDECHAR('\uFFFD')) {
 	mpt::widestring res;
 	res.reserve(str.length());
@@ -493,6 +545,15 @@
 		case common_encoding::amiga:
 			result = false;
 			break;
+		case common_encoding::iso8859_1_no_c1:
+			result = false;
+			break;
+		case common_encoding::iso8859_15_no_c1:
+			result = false;
+			break;
+		case common_encoding::amiga_no_c1:
+			result = false;
+			break;
 	}
 	return result;
 }
@@ -550,6 +611,15 @@
 		case common_encoding::amiga:
 			throw std::domain_error("unsupported encoding");
 			break;
+		case common_encoding::iso8859_1_no_c1:
+			throw std::domain_error("unsupported encoding");
+			break;
+		case common_encoding::iso8859_15_no_c1:
+			throw std::domain_error("unsupported encoding");
+			break;
+		case common_encoding::amiga_no_c1:
+			throw std::domain_error("unsupported encoding");
+			break;
 	}
 	return result;
 }
@@ -856,6 +926,15 @@
 		case common_encoding::amiga:
 			return encode_8bit<Tdststring>(src, CharsetTableAmiga);
 			break;
+		case common_encoding::iso8859_1_no_c1:
+			return encode_8bit_no_c1<Tdststring>(src, CharsetTableISO8859_1);
+			break;
+		case common_encoding::iso8859_15_no_c1:
+			return encode_8bit_no_c1<Tdststring>(src, CharsetTableISO8859_15);
+			break;
+		case common_encoding::amiga_no_c1:
+			return encode_8bit_no_c1<Tdststring>(src, CharsetTableAmiga);
+			break;
 	}
 	throw std::domain_error("unsupported encoding");
 }
@@ -953,6 +1032,15 @@
 		case common_encoding::amiga:
 			return decode_8bit(src, CharsetTableAmiga);
 			break;
+		case common_encoding::iso8859_1_no_c1:
+			return decode_8bit_no_c1(src, CharsetTableISO8859_1);
+			break;
+		case common_encoding::iso8859_15_no_c1:
+			return decode_8bit_no_c1(src, CharsetTableISO8859_15);
+			break;
+		case common_encoding::amiga_no_c1:
+			return decode_8bit_no_c1(src, CharsetTableAmiga);
+			break;
 	}
 	throw std::domain_error("unsupported encoding");
 }
manx

manx

2022-02-10 15:50

administrator   ~0005063

status update:

  • r16909 implements RISC OS charset
  • r16910 uses RISC OS charset for Digital Symphony files

RISC OS charset is also based on ISO8859-1, but has actual glyphs in the C1 range, so there is no question as to which these should map to in order to fix the same issue as this one.

manx

manx

2022-02-10 16:05

administrator   ~0005064

Last edited: 2022-02-10 16:20

status update:

  • updated patch for 2) and 4) using D) attached
  • second patch switches all Amiga formats to Amiga-noC1 charset. This also switches MOD, which is still up for discussion if it should rather use Windows-1252.

TODO:

  • maybe also switch all other ISO8859-1 usages to ISO8859-1-noC1 (I would rather not change it for formats where the specification clearly states ISO8889-1):
    • UMX
    • MP3 (would violate spec)
    • AU
    • WAV (would violate spec)
    • SCL (would violate spec)
bug1565-implement-no-c1-charsets-v2.patch (15,020 bytes)   
Index: common/mptString.cpp
===================================================================
--- common/mptString.cpp	(revision 16910)
+++ common/mptString.cpp	(working copy)
@@ -357,18 +357,21 @@
 	switch(charset)
 	{
 #if defined(MPT_ENABLE_CHARSET_LOCALE)
-		case Charset::Locale:      return mpt::encode<Tdststring>(mpt::logical_encoding::locale, src); break;
+		case Charset::Locale:           return mpt::encode<Tdststring>(mpt::logical_encoding::locale, src); break;
 #endif
-		case Charset::UTF8:        return mpt::encode<Tdststring>(mpt::common_encoding::utf8, src); break;
-		case Charset::ASCII:       return mpt::encode<Tdststring>(mpt::common_encoding::ascii, src); break;
-		case Charset::ISO8859_1:   return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_1, src); break;
-		case Charset::ISO8859_15:  return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_15, src); break;
-		case Charset::CP850:       return mpt::encode<Tdststring>(mpt::common_encoding::cp850, src); break;
-		case Charset::CP437:       return mpt::encode<Tdststring>(mpt::common_encoding::cp437, src); break;
-		case Charset::CP437AMS:    return mpt::encode<Tdststring>(CharsetTableCP437AMS, src); break;
-		case Charset::CP437AMS2:   return mpt::encode<Tdststring>(CharsetTableCP437AMS2, src); break;
-		case Charset::Windows1252: return mpt::encode<Tdststring>(mpt::common_encoding::windows1252, src); break;
-		case Charset::Amiga:       return mpt::encode<Tdststring>(mpt::common_encoding::amiga, src); break;
+		case Charset::UTF8:             return mpt::encode<Tdststring>(mpt::common_encoding::utf8, src); break;
+		case Charset::ASCII:            return mpt::encode<Tdststring>(mpt::common_encoding::ascii, src); break;
+		case Charset::ISO8859_1:        return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_1, src); break;
+		case Charset::ISO8859_15:       return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_15, src); break;
+		case Charset::CP850:            return mpt::encode<Tdststring>(mpt::common_encoding::cp850, src); break;
+		case Charset::CP437:            return mpt::encode<Tdststring>(mpt::common_encoding::cp437, src); break;
+		case Charset::CP437AMS:         return mpt::encode<Tdststring>(CharsetTableCP437AMS, src); break;
+		case Charset::CP437AMS2:        return mpt::encode<Tdststring>(CharsetTableCP437AMS2, src); break;
+		case Charset::Windows1252:      return mpt::encode<Tdststring>(mpt::common_encoding::windows1252, src); break;
+		case Charset::Amiga:            return mpt::encode<Tdststring>(mpt::common_encoding::amiga, src); break;
+		case Charset::ISO8859_1_no_C1:  return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_1_no_c1, src); break;
+		case Charset::ISO8859_15_no_C1: return mpt::encode<Tdststring>(mpt::common_encoding::iso8859_15_no_c1, src); break;
+		case Charset::Amiga_no_C1:      return mpt::encode<Tdststring>(mpt::common_encoding::amiga_no_c1, src); break;
 		case Charset::RISC_OS:     return mpt::encode<Tdststring>(mpt::common_encoding::riscos, src); break;
 	}
 	return Tdststring();
@@ -384,18 +387,21 @@
 	switch(charset)
 	{
 #if defined(MPT_ENABLE_CHARSET_LOCALE)
-		case Charset::Locale:      return mpt::decode<Tsrcstring>(mpt::logical_encoding::locale, src); break;
+		case Charset::Locale:           return mpt::decode<Tsrcstring>(mpt::logical_encoding::locale, src); break;
 #endif
-		case Charset::UTF8:        return mpt::decode<Tsrcstring>(mpt::common_encoding::utf8, src); break;
-		case Charset::ASCII:       return mpt::decode<Tsrcstring>(mpt::common_encoding::ascii, src); break;
-		case Charset::ISO8859_1:   return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_1, src); break;
-		case Charset::ISO8859_15:  return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_15, src); break;
-		case Charset::CP850:       return mpt::decode<Tsrcstring>(mpt::common_encoding::cp850, src); break;
-		case Charset::CP437:       return mpt::decode<Tsrcstring>(mpt::common_encoding::cp437, src); break;
-		case Charset::CP437AMS:    return mpt::decode<Tsrcstring>(CharsetTableCP437AMS, src); break;
-		case Charset::CP437AMS2:   return mpt::decode<Tsrcstring>(CharsetTableCP437AMS2, src); break;
-		case Charset::Windows1252: return mpt::decode<Tsrcstring>(mpt::common_encoding::windows1252, src); break;
-		case Charset::Amiga:       return mpt::decode<Tsrcstring>(mpt::common_encoding::amiga, src); break;
+		case Charset::UTF8:             return mpt::decode<Tsrcstring>(mpt::common_encoding::utf8, src); break;
+		case Charset::ASCII:            return mpt::decode<Tsrcstring>(mpt::common_encoding::ascii, src); break;
+		case Charset::ISO8859_1:        return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_1, src); break;
+		case Charset::ISO8859_15:       return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_15, src); break;
+		case Charset::CP850:            return mpt::decode<Tsrcstring>(mpt::common_encoding::cp850, src); break;
+		case Charset::CP437:            return mpt::decode<Tsrcstring>(mpt::common_encoding::cp437, src); break;
+		case Charset::CP437AMS:         return mpt::decode<Tsrcstring>(CharsetTableCP437AMS, src); break;
+		case Charset::CP437AMS2:        return mpt::decode<Tsrcstring>(CharsetTableCP437AMS2, src); break;
+		case Charset::Windows1252:      return mpt::decode<Tsrcstring>(mpt::common_encoding::windows1252, src); break;
+		case Charset::Amiga:            return mpt::decode<Tsrcstring>(mpt::common_encoding::amiga, src); break;
+		case Charset::ISO8859_1_no_C1:  return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_1_no_c1, src); break;
+		case Charset::ISO8859_15_no_C1: return mpt::decode<Tsrcstring>(mpt::common_encoding::iso8859_15_no_c1, src); break;
+		case Charset::Amiga_no_C1:      return mpt::decode<Tsrcstring>(mpt::common_encoding::amiga_no_c1, src); break;
 		case Charset::RISC_OS:     return mpt::decode<Tsrcstring>(mpt::common_encoding::riscos, src); break;
 	}
 	return mpt::widestring();
Index: common/mptString.h
===================================================================
--- common/mptString.h	(revision 16910)
+++ common/mptString.h	(working copy)
@@ -69,6 +69,10 @@
 	Amiga,
 	RISC_OS,
 
+	ISO8859_1_no_C1,
+	ISO8859_15_no_C1,
+	Amiga_no_C1,
+
 #if defined(MPT_ENABLE_CHARSET_LOCALE)
 	Locale, // CP_ACP on windows, current C locale otherwise
 #endif // MPT_ENABLE_CHARSET_LOCALE
Index: src/mpt/string/types.hpp
===================================================================
--- src/mpt/string/types.hpp	(revision 16910)
+++ src/mpt/string/types.hpp	(working copy)
@@ -38,6 +38,9 @@
 	windows1252,
 	amiga,
 	riscos,
+	iso8859_1_no_c1,
+	iso8859_15_no_c1,
+	amiga_no_c1,
 };
 
 
Index: src/mpt/string_transcode/transcode.hpp
===================================================================
--- src/mpt/string_transcode/transcode.hpp	(revision 16910)
+++ src/mpt/string_transcode/transcode.hpp	(working copy)
@@ -43,27 +43,25 @@
 
 
 
-/*
-default 1:1 mapping
+
+// default 1:1 mapping
 inline constexpr char32_t CharsetTableISO8859_1[256] = {
-	0x0000,0x0001,0x0002,0x0003,0x0004,0x0005,0x0006,0x0007,0x0008,0x0009,0x000a,0x000b,0x000c,0x000d,0x000e,0x000f,
-	0x0010,0x0011,0x0012,0x0013,0x0014,0x0015,0x0016,0x0017,0x0018,0x0019,0x001a,0x001b,0x001c,0x001d,0x001e,0x001f,
-	0x0020,0x0021,0x0022,0x0023,0x0024,0x0025,0x0026,0x0027,0x0028,0x0029,0x002a,0x002b,0x002c,0x002d,0x002e,0x002f,
-	0x0030,0x0031,0x0032,0x0033,0x0034,0x0035,0x0036,0x0037,0x0038,0x0039,0x003a,0x003b,0x003c,0x003d,0x003e,0x003f,
-	0x0040,0x0041,0x0042,0x0043,0x0044,0x0045,0x0046,0x0047,0x0048,0x0049,0x004a,0x004b,0x004c,0x004d,0x004e,0x004f,
-	0x0050,0x0051,0x0052,0x0053,0x0054,0x0055,0x0056,0x0057,0x0058,0x0059,0x005a,0x005b,0x005c,0x005d,0x005e,0x005f,
-	0x0060,0x0061,0x0062,0x0063,0x0064,0x0065,0x0066,0x0067,0x0068,0x0069,0x006a,0x006b,0x006c,0x006d,0x006e,0x006f,
-	0x0070,0x0071,0x0072,0x0073,0x0074,0x0075,0x0076,0x0077,0x0078,0x0079,0x007a,0x007b,0x007c,0x007d,0x007e,0x007f,
-	0x0080,0x0081,0x0082,0x0083,0x0084,0x0085,0x0086,0x0087,0x0088,0x0089,0x008a,0x008b,0x008c,0x008d,0x008e,0x008f,
-	0x0090,0x0091,0x0092,0x0093,0x0094,0x0095,0x0096,0x0097,0x0098,0x0099,0x009a,0x009b,0x009c,0x009d,0x009e,0x009f,
-	0x00a0,0x00a1,0x00a2,0x00a3,0x00a4,0x00a5,0x00a6,0x00a7,0x00a8,0x00a9,0x00aa,0x00ab,0x00ac,0x00ad,0x00ae,0x00af,
-	0x00b0,0x00b1,0x00b2,0x00b3,0x00b4,0x00b5,0x00b6,0x00b7,0x00b8,0x00b9,0x00ba,0x00bb,0x00bc,0x00bd,0x00be,0x00bf,
-	0x00c0,0x00c1,0x00c2,0x00c3,0x00c4,0x00c5,0x00c6,0x00c7,0x00c8,0x00c9,0x00ca,0x00cb,0x00cc,0x00cd,0x00ce,0x00cf,
-	0x00d0,0x00d1,0x00d2,0x00d3,0x00d4,0x00d5,0x00d6,0x00d7,0x00d8,0x00d9,0x00da,0x00db,0x00dc,0x00dd,0x00de,0x00df,
-	0x00e0,0x00e1,0x00e2,0x00e3,0x00e4,0x00e5,0x00e6,0x00e7,0x00e8,0x00e9,0x00ea,0x00eb,0x00ec,0x00ed,0x00ee,0x00ef,
-	0x00f0,0x00f1,0x00f2,0x00f3,0x00f4,0x00f5,0x00f6,0x00f7,0x00f8,0x00f9,0x00fa,0x00fb,0x00fc,0x00fd,0x00fe,0x00ff
-};
-*/
+	0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f,
+	0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001a, 0x001b, 0x001c, 0x001d, 0x001e, 0x001f,
+	0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002a, 0x002b, 0x002c, 0x002d, 0x002e, 0x002f,
+	0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003a, 0x003b, 0x003c, 0x003d, 0x003e, 0x003f,
+	0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004a, 0x004b, 0x004c, 0x004d, 0x004e, 0x004f,
+	0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005a, 0x005b, 0x005c, 0x005d, 0x005e, 0x005f,
+	0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006a, 0x006b, 0x006c, 0x006d, 0x006e, 0x006f,
+	0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007a, 0x007b, 0x007c, 0x007d, 0x007e, 0x007f,
+	0x0080, 0x0081, 0x0082, 0x0083, 0x0084, 0x0085, 0x0086, 0x0087, 0x0088, 0x0089, 0x008a, 0x008b, 0x008c, 0x008d, 0x008e, 0x008f,
+	0x0090, 0x0091, 0x0092, 0x0093, 0x0094, 0x0095, 0x0096, 0x0097, 0x0098, 0x0099, 0x009a, 0x009b, 0x009c, 0x009d, 0x009e, 0x009f,
+	0x00a0, 0x00a1, 0x00a2, 0x00a3, 0x00a4, 0x00a5, 0x00a6, 0x00a7, 0x00a8, 0x00a9, 0x00aa, 0x00ab, 0x00ac, 0x00ad, 0x00ae, 0x00af,
+	0x00b0, 0x00b1, 0x00b2, 0x00b3, 0x00b4, 0x00b5, 0x00b6, 0x00b7, 0x00b8, 0x00b9, 0x00ba, 0x00bb, 0x00bc, 0x00bd, 0x00be, 0x00bf,
+	0x00c0, 0x00c1, 0x00c2, 0x00c3, 0x00c4, 0x00c5, 0x00c6, 0x00c7, 0x00c8, 0x00c9, 0x00ca, 0x00cb, 0x00cc, 0x00cd, 0x00ce, 0x00cf,
+	0x00d0, 0x00d1, 0x00d2, 0x00d3, 0x00d4, 0x00d5, 0x00d6, 0x00d7, 0x00d8, 0x00d9, 0x00da, 0x00db, 0x00dc, 0x00dd, 0x00de, 0x00df,
+	0x00e0, 0x00e1, 0x00e2, 0x00e3, 0x00e4, 0x00e5, 0x00e6, 0x00e7, 0x00e8, 0x00e9, 0x00ea, 0x00eb, 0x00ec, 0x00ed, 0x00ee, 0x00ef,
+	0x00f0, 0x00f1, 0x00f2, 0x00f3, 0x00f4, 0x00f5, 0x00f6, 0x00f7, 0x00f8, 0x00f9, 0x00fa, 0x00fb, 0x00fc, 0x00fd, 0x00fe, 0x00ff};
 
 inline constexpr char32_t CharsetTableISO8859_15[256] = {
 	0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000a, 0x000b, 0x000c, 0x000d, 0x000e, 0x000f,
@@ -226,6 +224,60 @@
 }
 
 template <typename Tsrcstring>
+inline mpt::widestring decode_8bit_no_c1(const Tsrcstring & str, const char32_t (&table)[256], mpt::widechar replacement = MPT_WIDECHAR('\uFFFD')) {
+	mpt::widestring res;
+	res.reserve(str.length());
+	for (std::size_t i = 0; i < str.length(); ++i) {
+		std::size_t c = static_cast<std::size_t>(mpt::char_value(str[i]));
+		if ((0x80 <= c) && (c <= 0x9f)) {
+			res.push_back(replacement);
+		} else if (c < std::size(table)) {
+			res.push_back(static_cast<mpt::widechar>(table[c]));
+		} else {
+			res.push_back(replacement);
+		}
+	}
+	return res;
+}
+
+template <typename Tdststring>
+inline Tdststring encode_8bit_no_c1(const mpt::widestring & str, const char32_t (&table)[256], char replacement = '?') {
+	Tdststring res;
+	res.reserve(str.length());
+	for (std::size_t i = 0; i < str.length(); ++i) {
+		char32_t c = static_cast<char32_t>(str[i]);
+		bool found = false;
+		// Try non-control characters first.
+		// In cases where there are actual characters mirrored in this range (like in AMS/AMS2 character sets),
+		// characters in the common range are preferred this way.
+		for (std::size_t x = 0x20; x < std::size(table); ++x) {
+			if ((0x80 <= c) && (c <= 0x9f)) {
+				continue;
+			}
+			if (c == table[x]) {
+				res.push_back(static_cast<typename Tdststring::value_type>(static_cast<uint8>(x)));
+				found = true;
+				break;
+			}
+		}
+		if (!found) {
+			// try control characters
+			for (std::size_t x = 0x00; x < std::size(table) && x < 0x20; ++x) {
+				if (c == table[x]) {
+					res.push_back(static_cast<typename Tdststring::value_type>(static_cast<uint8>(x)));
+					found = true;
+					break;
+				}
+			}
+		}
+		if (!found) {
+			res.push_back(replacement);
+		}
+	}
+	return res;
+}
+
+template <typename Tsrcstring>
 inline mpt::widestring decode_ascii(const Tsrcstring & str, mpt::widechar replacement = MPT_WIDECHAR('\uFFFD')) {
 	mpt::widestring res;
 	res.reserve(str.length());
@@ -516,6 +568,15 @@
 		case common_encoding::riscos:
 			result = false;
 			break;
+		case common_encoding::iso8859_1_no_c1:
+			result = false;
+			break;
+		case common_encoding::iso8859_15_no_c1:
+			result = false;
+			break;
+		case common_encoding::amiga_no_c1:
+			result = false;
+			break;
 	}
 	return result;
 }
@@ -576,6 +637,15 @@
 		case common_encoding::riscos:
 			throw std::domain_error("unsupported encoding");
 			break;
+		case common_encoding::iso8859_1_no_c1:
+			throw std::domain_error("unsupported encoding");
+			break;
+		case common_encoding::iso8859_15_no_c1:
+			throw std::domain_error("unsupported encoding");
+			break;
+		case common_encoding::amiga_no_c1:
+			throw std::domain_error("unsupported encoding");
+			break;
 	}
 	return result;
 }
@@ -885,6 +955,15 @@
 		case common_encoding::riscos:
 			return encode_8bit<Tdststring>(src, CharsetTableRISC_OS);
 			break;
+		case common_encoding::iso8859_1_no_c1:
+			return encode_8bit_no_c1<Tdststring>(src, CharsetTableISO8859_1);
+			break;
+		case common_encoding::iso8859_15_no_c1:
+			return encode_8bit_no_c1<Tdststring>(src, CharsetTableISO8859_15);
+			break;
+		case common_encoding::amiga_no_c1:
+			return encode_8bit_no_c1<Tdststring>(src, CharsetTableAmiga);
+			break;
 	}
 	throw std::domain_error("unsupported encoding");
 }
@@ -985,6 +1064,15 @@
 		case common_encoding::riscos:
 			return decode_8bit(src, CharsetTableRISC_OS);
 			break;
+		case common_encoding::iso8859_1_no_c1:
+			return decode_8bit_no_c1(src, CharsetTableISO8859_1);
+			break;
+		case common_encoding::iso8859_15_no_c1:
+			return decode_8bit_no_c1(src, CharsetTableISO8859_15);
+			break;
+		case common_encoding::amiga_no_c1:
+			return decode_8bit_no_c1(src, CharsetTableAmiga);
+			break;
 	}
 	throw std::domain_error("unsupported encoding");
 }
bug1565-use-amiga-no-c1-charset-v1.patch (9,917 bytes)   
Index: soundlib/Load_dbm.cpp
===================================================================
--- soundlib/Load_dbm.cpp	(revision 16910)
+++ soundlib/Load_dbm.cpp	(working copy)
@@ -356,7 +356,7 @@
 	m_modFormat.formatName = U_("DigiBooster Pro");
 	m_modFormat.type = U_("dbm");
 	m_modFormat.madeWithTracker = MPT_UFORMAT("DigiBooster Pro {}.{}")(mpt::ufmt::hex(fileHeader.trkVerHi), mpt::ufmt::hex(fileHeader.trkVerLo));
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	// Name chunk
 	FileReader nameChunk = chunks.GetChunk(DBMChunk::idNAME);
@@ -383,7 +383,7 @@
 			if(Order.AddSequence() == SEQUENCEINDEX_INVALID)
 				break;
 		}
-		Order().SetName(mpt::ToUnicode(mpt::Charset::ISO8859_1, name));
+		Order().SetName(mpt::ToUnicode(mpt::Charset::Amiga_no_C1, name));
 		ReadOrderFromFile<uint16be>(Order(), songChunk, numOrders);
 #else
 		const ORDERINDEX startIndex = Order().GetLength();
Index: soundlib/Load_digi.cpp
===================================================================
--- soundlib/Load_digi.cpp	(revision 16910)
+++ soundlib/Load_digi.cpp	(working copy)
@@ -129,7 +129,7 @@
 	m_modFormat.formatName = U_("DigiBooster");
 	m_modFormat.type = U_("digi");
 	m_modFormat.madeWithTracker = MPT_UFORMAT("Digi Booster {}.{}")(fileHeader.versionInt >> 4, fileHeader.versionInt & 0x0F);
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	ReadOrderFromArray(Order(), fileHeader.orders, fileHeader.lastOrdIndex + 1);
 
Index: soundlib/Load_dtm.cpp
===================================================================
--- soundlib/Load_dtm.cpp	(revision 16910)
+++ soundlib/Load_dtm.cpp	(working copy)
@@ -598,7 +598,7 @@
 	m_modFormat.formatName = U_("Digital Tracker");
 	m_modFormat.type = U_("dtm");
 	m_modFormat.madeWithTracker = std::move(tracker);
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	return true;
 }
Index: soundlib/Load_gt2.cpp
===================================================================
--- soundlib/Load_gt2.cpp	(revision 16910)
+++ soundlib/Load_gt2.cpp	(working copy)
@@ -540,7 +540,7 @@
 	m_modFormat.madeWithTracker = U_("Graoumf Tracker");
 	m_modFormat.formatName = MPT_UFORMAT("Graoumf Tracker v{}")(fileHeader.fileVersion);
 	m_modFormat.type = U_("gtk");
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	m_songName = mpt::String::ReadBuf(mpt::String::spacePadded, fileHeader.songName);
 	size_t msgLength = sizeof(fileHeader.smallComment);
@@ -1223,7 +1223,7 @@
 	m_modFormat.madeWithTracker = mpt::ToUnicode(mpt::Charset::ASCII, mpt::String::ReadBuf(mpt::String::spacePadded, fileHeader.trackerName));
 	m_modFormat.formatName = (fileHeader.fileVersion <= 5 ? MPT_UFORMAT("Graoumf Tracker v{}") : MPT_UFORMAT("Graoumf Tracker 2 v{}"))(fileHeader.fileVersion);
 	m_modFormat.type = U_("gt2");
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	m_songName = mpt::String::ReadBuf(mpt::String::spacePadded, fileHeader.songName);
 
Index: soundlib/Load_med.cpp
===================================================================
--- soundlib/Load_med.cpp	(revision 16910)
+++ soundlib/Load_med.cpp	(working copy)
@@ -1146,7 +1146,7 @@
 					}
 				}
 				if(playSeq.name[0])
-					order.SetName(mpt::ToUnicode(mpt::Charset::ISO8859_1, mpt::String::ReadAutoBuf(playSeq.name)));
+					order.SetName(mpt::ToUnicode(mpt::Charset::Amiga_no_C1, mpt::String::ReadAutoBuf(playSeq.name)));
 
 				// Play commands (jump / stop)
 				if(playSeq.commandTableOffset > 0 && file.Seek(playSeq.commandTableOffset))
@@ -1194,7 +1194,7 @@
 		{
 			file.ReadString<mpt::String::maybeNullTerminated>(m_songName, expData.songNameLength);
 			if(numSongs > 1)
-				order.SetName(mpt::ToUnicode(mpt::Charset::ISO8859_1, m_songName));
+				order.SetName(mpt::ToUnicode(mpt::Charset::Amiga_no_C1, m_songName));
 		}
 		if(expData.annoLength > 1 && file.Seek(expData.annoText))
 		{
@@ -1434,7 +1434,7 @@
 	m_modFormat.formatName = MPT_UFORMAT("OctaMED (MMD{})")(version);
 	m_modFormat.type = MPT_USTRING("med");
 	m_modFormat.madeWithTracker = madeWithTracker;
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	return true;
 }
Index: soundlib/Load_mid.cpp
===================================================================
--- soundlib/Load_mid.cpp	(revision 16910)
+++ soundlib/Load_mid.cpp	(working copy)
@@ -640,7 +640,7 @@
 	m_modFormat.formatName = U_("Standard MIDI File");
 	m_modFormat.type = isRIFF ? UL_("rmi") : UL_("mid");
 	m_modFormat.madeWithTracker = U_("Standard MIDI File");
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	SetMixLevels(MixLevels::v1_17RC3);
 	m_nTempoMode = TempoMode::Modern;
Index: soundlib/Load_mo3.cpp
===================================================================
--- soundlib/Load_mo3.cpp	(revision 16910)
+++ soundlib/Load_mo3.cpp	(working copy)
@@ -1932,7 +1932,7 @@
 	if(m_dwLastSavedWithVersion)
 		m_modFormat.charset = mpt::Charset::Windows1252;
 	else if(GetType() == MOD_TYPE_MOD)
-		m_modFormat.charset = mpt::Charset::ISO8859_1;
+		m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 	else
 		m_modFormat.charset = mpt::Charset::CP437;
 
Index: soundlib/Load_mod.cpp
===================================================================
--- soundlib/Load_mod.cpp	(revision 16910)
+++ soundlib/Load_mod.cpp	(working copy)
@@ -1309,7 +1309,7 @@
 	m_modFormat.type = U_("mod");
 	if(modMagicResult.madeWithTracker)
 		m_modFormat.madeWithTracker = modMagicResult.madeWithTracker;
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	return true;
 }
@@ -1811,7 +1811,7 @@
 	m_modFormat.formatName = U_("Soundtracker");
 	m_modFormat.type = U_("stk");
 	m_modFormat.madeWithTracker = madeWithTracker;
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	// Reading samples
 	if(loadFlags & loadSampleData)
@@ -1900,13 +1900,13 @@
 		m_modFormat.formatName = U_("MnemoTroN SoundTracker");
 		m_modFormat.type = U_("st26");
 		m_modFormat.madeWithTracker = U_("SoundTracker 2.6");
-		m_modFormat.charset = mpt::Charset::ISO8859_1;
+		m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 	} else if(IsMagic(magic, "IT10"))
 	{
 		m_modFormat.formatName = U_("Ice Tracker");
 		m_modFormat.type = U_("ice");
 		m_modFormat.madeWithTracker = U_("Ice Tracker 1.0 / 1.1");
-		m_modFormat.charset = mpt::Charset::ISO8859_1;
+		m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 	} else
 	{
 		return false;
@@ -2140,7 +2140,7 @@
 			chunk.Skip(4);
 			if(chunk.ReadMagic("PT") && iffHead.chunksize > 6)
 			{
-				chunk.ReadString<mpt::String::maybeNullTerminated>(version, mpt::Charset::ISO8859_1, iffHead.chunksize - 6);
+				chunk.ReadString<mpt::String::maybeNullTerminated>(version, mpt::Charset::Amiga_no_C1, iffHead.chunksize - 6);
 			}
 			break;
 
@@ -2196,7 +2196,7 @@
 			std::string author;
 			commentChunk.ReadString<mpt::String::maybeNullTerminated>(author, 32);
 			if(author != "UNNAMED AUTHOR")
-				m_songArtist = mpt::ToUnicode(mpt::Charset::ISO8859_1, author);
+				m_songArtist = mpt::ToUnicode(mpt::Charset::Amiga_no_C1, author);
 			if(!commentChunk.NoBytesLeft())
 			{
 				m_songMessage.ReadFixedLineLength(commentChunk, commentChunk.BytesLeft(), 40, 0);
Index: soundlib/Load_okt.cpp
===================================================================
--- soundlib/Load_okt.cpp	(revision 16910)
+++ soundlib/Load_okt.cpp	(working copy)
@@ -340,7 +340,7 @@
 
 	m_modFormat.formatName = U_("Oktalyzer");
 	m_modFormat.type = U_("okt");
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	// Go through IFF chunks...
 	while(file.CanRead(sizeof(OktIffChunk)))
Index: soundlib/Load_sfx.cpp
===================================================================
--- soundlib/Load_sfx.cpp	(revision 16910)
+++ soundlib/Load_sfx.cpp	(working copy)
@@ -469,7 +469,7 @@
 
 	m_modFormat.formatName = m_nSamples == 15 ? MPT_UFORMAT("SoundFX 1.{}")(version) : U_("SoundFX 2.0 / MultiMedia Sound");
 	m_modFormat.type = m_nSamples == 15 ? UL_("sfx") : UL_("sfx2");
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	return true;
 }
Index: soundlib/Load_stp.cpp
===================================================================
--- soundlib/Load_stp.cpp	(revision 16910)
+++ soundlib/Load_stp.cpp	(working copy)
@@ -258,7 +258,7 @@
 
 	m_modFormat.formatName = MPT_UFORMAT("Soundtracker Pro II v{}")(fileHeader.version);
 	m_modFormat.type = U_("stp");
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	m_nChannels = 4;
 	m_nSamples = 0;
Index: soundlib/Load_symmod.cpp
===================================================================
--- soundlib/Load_symmod.cpp	(revision 16910)
+++ soundlib/Load_symmod.cpp	(working copy)
@@ -1231,7 +1231,7 @@
 			if(symInst.IsEmpty() || symInst.IsVirtual())
 				continue;
 			
-			auto filename = mpt::PathString::FromUnicode(mpt::ToUnicode(mpt::Charset::ISO8859_1, symInst.GetName()));
+			auto filename = mpt::PathString::FromUnicode(mpt::ToUnicode(mpt::Charset::Amiga_no_C1, symInst.GetName()));
 			if(file.GetOptionalFileName())
 				filename = file.GetOptionalFileName()->GetPath() + filename.GetFullFileName();
 			
@@ -1939,7 +1939,7 @@
 		m_modFormat.madeWithTracker = U_("Symphonie Pro");
 	else
 		m_modFormat.madeWithTracker = U_("Symphonie Pro 256");
-	m_modFormat.charset = mpt::Charset::ISO8859_1;
+	m_modFormat.charset = mpt::Charset::Amiga_no_C1;
 
 	return true;
 }
Saga Musix

Saga Musix

2022-02-10 18:30

administrator   ~0005065

second patch switches all Amiga formats to Amiga-noC1 charset. This also switches MOD, which is still up for discussion if it should rather use Windows-1252.

I'd say MOD would make sense to use Windows-1252, and Amiga-noC1 for the rest.

UMX

UMX itself shouldn't contain any strings, as all the actual module loading is taken care of by the IT/XM/S3M/etc. loaders. For UAX, Windows-1252 would make sense as it's a Windows format.

manx

manx

2022-02-10 18:47

administrator   ~0005066

I'd say MOD would make sense to use Windows-1252, and Amiga-noC1 for the rest.

In any case, this could also be a later additional commit.

UMX itself shouldn't contain any strings, as all the actual module loading is taken care of by the IT/XM/S3M/etc. loaders.

We parse out the filename inside container formats, but it is currently unused, so it does not really matter.

For UAX, Windows-1252 would make sense as it's a Windows format.

UAX is already using Windows-1252,

Saga Musix

Saga Musix

2022-02-10 18:48

administrator   ~0005067

UAX and UMX are the same format anyway (just different contents); both should definitely be using Windows-1252.

manx

manx

2022-02-10 18:52

administrator   ~0005068

UAX and UMX are the same format anyway (just different contents); both should definitely be using Windows-1252.

Yeah, true. r16911

manx

manx

2022-02-11 09:52

administrator   ~0005071

status update:

  • r16903 uses Amiga charset for LHA
  • r16911 uses Windows-1252 charset for UMX
  • r16912 implements 2) and 4) with option D)
  • r16913 uses Amiga_no_C1 charset for LHA
  • r16916 uses Amiga charset for all Amiga formats
  • r16917 uses Amiga_no_C1 charset for all Amiga formats
  • r16915 uses ISO8859_1_no_C1 charset for GTK/GT2

TODO:

  • use Windows-1252 for MOD
manx

manx

2022-02-14 08:56

administrator   ~0005082

merged in 0.6 as r16930 / 0.6.2-pre.1

manx

manx

2022-02-16 15:25

administrator   ~0005094

I'll close this issue as fixed.

The only remaining issue is about how to achieve compatibility for .MOD with earlier OpenMPT versions, which is only vaguely related to this issue. However, the .MOD charset issue does not need to be resolved right now, because in the tracker, GetCharsetInternal() still just always returns mpt::Charset::Locale, which we did precisely for these compatibility concerns in the first place. I added a note in 0000569 so that we do not forget.

Issue History

Date Modified Username Field Change
2022-02-07 05:25 ProgrammerNerd New Issue
2022-02-07 05:44 ProgrammerNerd Note Added: 0005043
2022-02-07 05:45 ProgrammerNerd Note Added: 0005044
2022-02-07 08:02 Saga Musix Steps to Reproduce Updated
2022-02-07 08:02 Saga Musix Steps to Reproduce Updated
2022-02-07 09:15 Saga Musix Note Added: 0005045
2022-02-08 01:15 ProgrammerNerd Note Added: 0005048
2022-02-08 03:47 ProgrammerNerd Note Added: 0005049
2022-02-08 03:47 ProgrammerNerd File Added: protracker-screenshot.png
2022-02-08 07:35 manx Note Added: 0005050
2022-02-08 07:35 manx Assigned To => manx
2022-02-08 07:35 manx Status new => confirmed
2022-02-08 08:13 Saga Musix Note Added: 0005051
2022-02-08 09:08 manx Note Added: 0005052
2022-02-08 10:51 Saga Musix Note Added: 0005053
2022-02-08 12:28 manx Note Added: 0005054
2022-02-08 19:37 Saga Musix Note Added: 0005055
2022-02-08 19:39 manx Note Added: 0005056
2022-02-08 19:42 manx Note Added: 0005057
2022-02-08 20:04 manx Note Added: 0005058
2022-02-10 10:46 manx Note Added: 0005061
2022-02-10 10:49 manx Note Edited: 0005061
2022-02-10 11:55 manx Note Edited: 0005061
2022-02-10 15:08 manx Note Added: 0005062
2022-02-10 15:08 manx File Added: bug1565-implement-no-c1-charsets-v1.patch
2022-02-10 15:09 manx Note Edited: 0005061
2022-02-10 15:09 manx Note Edited: 0005062
2022-02-10 15:09 manx Note Edited: 0005062
2022-02-10 15:50 manx Note Added: 0005063
2022-02-10 16:05 manx Note Added: 0005064
2022-02-10 16:05 manx File Added: bug1565-implement-no-c1-charsets-v2.patch
2022-02-10 16:05 manx File Added: bug1565-use-amiga-no-c1-charset-v1.patch
2022-02-10 16:20 manx Note Edited: 0005064
2022-02-10 18:30 Saga Musix Note Added: 0005065
2022-02-10 18:47 manx Note Added: 0005066
2022-02-10 18:48 Saga Musix Note Added: 0005067
2022-02-10 18:52 manx Note Added: 0005068
2022-02-11 09:13 manx Relationship added related to 0001567
2022-02-11 09:52 manx Note Added: 0005071
2022-02-11 11:49 manx Target Version => OpenMPT 1.30.03.00 / libopenmpt 0.6.2 (upgrade first)
2022-02-14 08:56 manx Note Added: 0005082
2022-02-16 15:20 manx Relationship added related to 0000569
2022-02-16 15:25 manx Status confirmed => resolved
2022-02-16 15:25 manx Resolution open => fixed
2022-02-16 15:25 manx Fixed in Version => OpenMPT 1.30.03.00 / libopenmpt 0.6.2 (upgrade first)
2022-02-16 15:25 manx Note Added: 0005094