To: ietf-charsets@iana.org
From: Mark Davis, IBM
Re: Charset Identity and Registration
Date: 2003-05-22
Latest Version http://source.icu-project.org/repos/icu/icuhtml/trunk/design/charset_questions.html

We have a number of questions about the application of RFC 2978 that are important for resolving which charsets IBM should register. We will illustrate the issues with a list of named examples. Although these examples are boiled down to a small set of very few mappings for simplicity, they do call out the major issues that are important to clarify.

Notation

Each example consists of a mapping table for a possible charset. The octets are on the left, with the corresponding Unicode/10646 code points on the right. The arrows show where there is a mapping, using the following convention:

Notation Description (where B is an octet sequence and U is a Unicode/10646 character sequence)
B <=> U A roundtrip mapping, where B maps to U and U maps back to B
B <=  U A fallback mapping, where B does not map to U, but U maps back to B
B  => U A reverse-fallback mapping, where B maps to U, but U does not map back to B

Examples

  1. Base
    0x61 <=> U+0061

  2. Base_Fallback
    0x61 <=> U+0061
    0x61 <=  U+FFF1

  3. Base_Reverse_Fallback
    0x61 <=> U+0061
    0xF1  => U+0061

  4. Base_Roundtrip1
    0x61 <=> U+0061
    0xF1 <=> U+FFF1

  5. Base_Roundtrip2
    0x61 <=> U+0061
    0xF1 <=> U+00F1

These examples illustrate a great many instances of character sets. There are many such examples in the ICU Character Set Repository, which contains many cases where two charsets differ only by additional fallbacks, reverse fallbacks, roundtrips, or a combination of two or three of these. There are other examples at XML Japanese Profile and many others in IBM's database of code page information.

With these examples in mind, here are the questions:

Q1. If charset Base is already registered, which of the others is a possible candidate for registration?

 In the RFC, the operative language appears to be the following:

  1. "The term "charset" (referred to as a "character set" in previous versions of this document) is used here to refer to a method of converting a sequence of octets into a sequence of characters."
  2. "A charset should therefore be registered ONLY if it adds significant functionality that is valuable to a large community, OR if it documents existing practice in a large community."
  3. "Inclusion of a mapping to ISO 10646 is now recommended for all registered charsets."

From  (A), we have that a charset can be thought of as a set of mappings from octets to characters. According to that, Base_Reverse_Fallback, Base_Roundtrip1, and Base_Roundtrip2 are definitely not the same charset as Base. This is because they mapping different sequences of octets. Thus, if they met criterion B, they would each be candidates for registration. From the text, the situation is much less clear with Base_Fallback.

Q2. If charset Base_Roundtrip1 is registered, which of the others is a candidate for registration?

It would appear that Base_Roundtrip2 is clearly not the same charset, and thus if it satisfied B it would be a candidate. But it is unclear whether Base would be or not, since it would be a complete subset of a registered charset. And we need confirmation as to the others as well.

Q3. How does the RFC handle such additions over time?

This was very unclear to us. Assume again that Base is registered. There are cases where someone adds mappings to a Base (ending up with something like Base_Roundtrip1), but does not register it as a new charset. For example, Microsoft added the Euro sign to windows-1252. Does this require a new registration or not? It also relates to the issue of stability, below.

Q4. Does the RFC require any sort of stability?

For example, there is a mapping supplied in windows-1252 registration, pointing to Windows Code Page 1252, but the contents of the latter page may change over time, without notice; consulting that link now vs. a year from now might have different answers, and one doesn't know when and if changes were made in the past. (This is not a criticism of Microsoft; this happens in many other cases.) It appears that there is no requirement for stability in the RFC. Is this true?

Note that when Microsoft added a mapping for Euro, they actually (e.g. in terms of the results of their API) changed the mapping from 0x80 <=> U+0080, to 0x80 <=> U+20AC. So on an API level, it was not an addition, but a change of mapping, like from Base_Roundtrip1 to Base_Roundtrip2. However, since the registration didn't actually supply a mapping for 0x80, formally speaking it was an addition, like from Base to Base_Roundtrip2.

Q5. Can mapping tables be added to older registrations?

Many of the existing registrations do not have mapping tables supplied with them, and in many of those cases it is difficult to get the original document that defined the mapping. Should (or could) the original applicants supply mapping tables for those? For example, should (or could) IBM supply mapping tables for the "ibm-x" charsets that are already in the table, but do not have mappings to and from Unicode/10646?

Q6. It would appear to us that there are two possible bright-line rules. Which is the closest to IANA policy?

R1. Two mapping tables get the same name if and only if all of their roundtrip mappings are identical.

R2. Two mapping tables get the same name if and only if all of their mappings (roundtrip, fallback, and reverse fallback) are the same.

R2 is more complete, in the sense that any use of the two mapping tables with the same IANA name will have identical results. However, it appears that this policy would not be consistent with what has been practiced in the past. Thus we are guessing that R1 would be the best bright-line rule. It would at least guarantee that any mapping tables with the same IANA name if used with fallbacks and reverse fallbacks turned off would have identical results.

References: