Character Conversion Alias Design

Draft 2001-07-12

Character Conversion is complicated by the facts that:

ICU attempts to untangle this morass in the following ways.

1. There will be unique name for each mapping table on all ICU supported platforms (390, 400, AIX, Solaris, Windows, HP/UX) plus on Java and the Mac, in the format given by UTR #22: Character Mapping Tables.

2. There will be APIs that supply information relating platform names (aliases) for character mappings to this canonical name.

Note: by an alias, we mean any name that is commonly used to refer to that character conversion mapping on that platform. For example, if I am working on AIX with HTTP, what CCSID do I use when I encounter charset="SJIS"? Similarly, what is the (preferred) alias that I would use when generating HTML or XML that is supposed to be in a particular CCSID? The tables that relate aliases to CCSIDs may be in the operating system, or they may be in major products like DB2.

A sketch of the APIs are presented below (in pseudocode, since we don't want to worry about the exact signatures yet):

Basic

Returns the canonical name, given the platform and alias. If the platform is null, then a default is returned.

String ucnv_getCanonicalName(String alias, String platform);

Returns an alias, given the canonical name and a platform. If the platform is null, then a default is returned.

String ucnv_getPreferredAlias(String canonicalName, String platform);

To guarantee interoperability, the default values are chosen to be same across all platforms.

Listing

Returns the platforms for which ICU has mapping tables:

String[] ucnv_getPlatforms();

Returns a list of all the aliases that map to the given canonical name on the given platform. The first name is the preferred one.

String[] ucnv_getAliases(String canonicalName, String platform);

Note: for both basic and listing, the platform value might be an enum instead of a string.

Data Structure

Logically, all of this could be supported by the following three mappings

Main alias, platform => canonicalName
DefaultForCanonicalName canonicalName => alias
DefaultForAlias alias => canonicalName

However, we can simplify the mapping considerably, by accepting a superset of names that are actually used on each platform. That way, we don't have to mark the names with the precise set of platforms that they are used on; only where it is important, such as where we need different aliases for different platforms.

Thoughts on how we could pack this, and still look up items quickly. This is a first pass -- we might come up with something cleverer or more compact after thinking about it.

Put the aliases into a sorted list. The index in that list now stands for the alias. Do the same for canonical names and platforms. So now all the mappings are in terms of integers. The aliases and canonical names fit in 2 bytes; the platforms in 1 byte. Actually, we only use 7 bits; the top bit is a termination flag. A platform of 0x7F stands for "Any"

The main table is a list of 2-byte offsets, one for each alias. Each offset points to a list of three-byte units. In each unit, the first two bytes are the canonical name, and the last byte is the platform. The last unit is indicated by an 0x80 bit  in the platform byte. The first unit is the default canonical name for that alias, and is also the default in case there is no match for the platform. There is at most one unit for each platform in the list.

The reverse table is similar, except that it starts with a list of 2-byte offsets, one for each canonical name. Each offset points to a list of three-byte units. In each unit, the first two bytes are the alias.  The last unit is indicated by an 0x80 bit  in the platform byte. The first unit is the default alias for that canonical, and is also the default in case there is no match for the platform. There may be more than one unit for each platform in the list. The first unit with a matching platform is the default alias for that platform.

Resource Bundle Format

The current format is the following:

http://oss.software.ibm.com/developerworks/opensource/cvs/icu/data/convrtrs.txt?rev=1.81

This format is almost sufficient to get what we need; it basically corresponds to the reverse table above. We would need the following changes.

  1. Change the "Actual file name" to be the canonical name. Where we need to, add the old one as an alias.
  2. Add a tag for each platform.
  3. The default alias (for the given canonical name) can be work as above.
  4. We add one piece of syntax for marking the default canonical name for a given alias. I suggest the following: Given an alias name A
    1. If there is a * following A on a line, then the canonical name for that line is the default for A.
    2. Otherwise, if there is a * following a canonical name, and that line contains A, then that canonical name is the default for A.
    3. The builder code throws an error:
      • If there are no lines satisfying either (a) or (b) for A.
      • If there are two different lines satisfying (a) for A.
  5. The builder code throws an error two different canonical names have the same <alias, platform>.

Example:

old

CName Aliases
ibm-916 iso-8859-8 { MIME } hebrew cp916 8859-8 csisolatinhebrew iso-ir-138 ISO_8859-8:1988 { IANA } 916

new

CName Aliases
iso-8859_8-1988* iso-8859-8 { MIME } hebrew cp916 8859-8 csisolatinhebrew iso-ir-138 ISO_8859-8:1988 { IANA } 916 ibm-916

The reverse table is easy to build from this. As it is being built, the mappings for the main table can be collected and then organized and written out.

BTW, we should move the converter data files to a subfolder of /data/ so they are not interleaved with other files. Also, I think readability would be improved if we allowed commas in the lists, e.g.