Draft 2002-09-10
Character Conversion is complicated by the facts that:
ICU will attempt to untangle this problem in the following ways.
These are the goals of this design document.
These are the assumptions for this design.
Logically, all of this could be supported by the following three mappings
|
Main |
alias, platform => canonicalName |
|
DefaultForCanonicalName |
canonicalName => alias |
|
DefaultForAlias |
alias => canonicalName |
The canonical alias from UTR #22 has some drawbacks. Even though this is an excellent format to distinguish an alias internally, no other vender supports the names. We should make it easy for our user's to get an IANA, MIME, or some other platform/version specific name. UTR #22 also has the drawback that the format can conflict with existing aliases, and our users usually do not have an accurate way to distinguish between a canonical name and an alias (e.g. iso-8859-1 vs. aix-iso8859_1-4.3.6).
Putting features of a codepage into a converter name (e.g. VASCII, VPUA, VSUB, VNLLF) may be useful for internally distinguishing the codepage features. However, our users generally do not know these differences existed, and few other platforms recognize these specially made up names. Our users usually know the platform and the alias (or preferred alias), but they don't know the features.
Using the feature list for a converter name to find the appropriate fallback codepage is a rather difficult thing to do. Which do you try to fallback to first when a converter can't be opened? Do I want one that has the Euro update, or do I want one that works on a certain platform with VASCII? Our converter alias design can probably address this issue by allowing the user to iterate over the aliases based on a standard or platform name and selecting the right one depending on the features or the version in the canonical name.
When a converter can't be opened, you probably want to try to open another one that worked on a given platform at one time. If the fallback path is wrong for a user, we can either allow the user to modify the alias table before build time, allow the user to progmatically modify the behavior, or allow the user to progmatically query the fallback path and let them decide on the fly.
The current format is the following:
http://source.icu-project.org/repos/icu/icu/trunk/source/data/mappings/convrtrs.txt
The following examples are how a new converter alias table could be used. While I've tried to make it complete, it's not guaranteed accurate. Use the examples as ways to use the new format.
These aliases with the tags are going to get really long. We should consider allowing line wrapping. This could be done in a similar way that Makefiles do line wrapping. If the line starts with whitespace, then it must be a continuation of the previous line.
One possible alias table format can look like the following in BNF. For the sake of simplicity, the comments, newlines and makefile style whitespace usage are left out of this format.
|
AliasTable |
'{' SupportedPlatform* '}' |
|
SupportedPlatform |
[a-zA-Z_]+ |
|
ConverterAliases |
CanonicalConverterName Tags* Aliases* |
|
CanonicalConverterName |
[a-zA-Z0-9_]+'-'[a-zA-Z0-9_]+'-'[a-zA-Z0-9_]+ |
|
Tags |
'{' Tag+ '}' |
|
Tag |
SupportedPlatform AlternatePlatformAlias? |
|
DefaultPlatformAlias |
'*' |
|
Aliases |
Alias Tags* |
|
Alias |
[a-zA-Z_'-']+ |
The asterisk "*" is used for denoting the default alias. This may seem a little odd since some people may think 'zero or more' when looking and this BNF, but it is a literal character used as a way to denote the default alias, and it is usually much quicker and easier to just denote which one is the default with one character. We are also limited to use only invariant characters in this table.
We could do this table as a resource bundle, but this data can get very large when alias versioning is considered. So we should optimize the data format as much as possible. We could also do this table in XML, but this may be difficult to include XML into out builds. We can consider exporting it as XML, which would be very useful when combined with XSL in order to generate HTML for public viewing.
Old alias table
ibm-916 iso-8859-8 { MIME } hebrew cp916 8859-8 csisolatinhebrew iso-ir-138 ISO_8859-8:1988 { IANA } 916
# japanese. Unicode name is \u30b7\u30d5\u30c8\u7b26\u53f7\u5316\u8868\u73fe
# Iana says that Windows-31J is an extension to csshiftjis
ibm-943_P130-2000 ibm-943_VASCII_VSUB_VPUA ibm-943
ibm-943_P14A-2000 ibm-943_VSUB_VPUA Shift_JIS { MIME } csWindows31J sjis cp943 cp932 ms_kanji csshiftjis windows-31j x-sjis 943
ibm-942_P120-2000 ibm-942_VASCII_VSUB_VPUA ibm-942 ibm-932 ibm-932_VASCII_VSUB_VPUA
ibm-942_P12A-2000 ibm-942_VSUB_VPUA shift_jis78 sjis78 pck ibm-932_VSUB_VPUA
New alias table
ibm-916 { IBM* } ibm-916_P100-1987 { UTR22 }
iso-8859-8 { MIME* IANA WINDOWS }
ISO8859_8 { AIX* } # duplicate of previous alias in a slightly different form
hebrew { IANA } # This is one of the supported aliases from IANA.
cp916 { JAVA* }
28598 { WINDOWS* }
8859-8
csisolatinhebrew { IANA } iso-ir-138 { IANA } # You can have more than one alias per line.
ISO_8859-8:1988 { IANA* } # This is the default alias for IANA.
916 { ZOS* OS400* DB2* IBM }
# japanese. Unicode name is \u30b7\u30d5\u30c8\u7b26\u53f7\u5316\u8868\u73fe
# Iana says that Windows-31J is an extension to csshiftjis
ibm-943_P130 { UTR22* } ibm-943_VASCII_VSUB_VPUA { ICU_FEATURE* } ibm-943 { ZOS* OS400* DB2* IBM }
cp943 { JAVA* } Shift_JIS { MIME IANA DB2* }
ibm-943_P14A { UTR22* } ibm-943_VSUB_VPUA { ICU_FEATURE* } Shift_JIS { MIME* IANA* WINDOWS* DB2 } csWindows31J sjis
cp943 cp943C { JAVA* } # Java uses the C variant only
cp932 { ICU } 932 { WINDOWS* } 943 { IBM }
ms_kanji {IANA} csshiftjis {IANA} windows-31j x-sjis
ibm-942_P120
ibm-942_VASCII_VSUB_VPUA
ibm-942
ibm-932
ibm-932_VASCII_VSUB_VPUA
ibm-942_P12A
ibm-942_VSUB_VPUA
shift_jis78
sjis78
pck { SOLARIS* }
ibm-932_VSUB_VPUA
Notice that there are two Shift_JIS aliases, but only one of them is the default for a given tag.
There should be one and only one default tag of a given type per line and per alias. So you can't have WINDOWS be the default for Shift_JIS on two different converters, and you can't have more than one the default alias for a converter. So the following two examples are illegal.
ibm-942_P12A
sjis78 { SOLARIS }
pck { SOLARIS }
The previous example can be fixed by properly setting the defaults for the aliases to the following:
ibm-942_P12A
sjis78 { SOLARIS* }
pck { SOLARIS }
If we allowed alias versioning, we might be able to have the same standard and alias on different mappings tables, like the following:
ibm-942_P120
pck { SOLARIS/1* }
ibm-942_P12A
pck { SOLARIS/2* }
Having multiple aliases for a converter and multiple versions of that alias are two orthogonal ideas. So they need to be represented with different syntax. Having versioned aliases is useful, but most people usually want the current mapping. For instance most people want the current mapping of windows-1252 with Euro support. This is a feature would be useful, but this may be a low priority. Versioning would also make the internal data structure a 4 dimensional array (converters x standards x aliases x version).
In order to reduce the chance of a misspelling we should consider requiring a list of supported tags at the beginning of the file. These tags should probably be case insensitive. This list can also be used to specify the preference for opening a converter when there are multiple aliases and no standard is specified (e.g. ucnv_open()).
Three tags of interest are the ICU, ICU_FEATURE and ICU_CANONICAL tags. The ICU tags are names that ICU made up or misused and are needed for historical reasons, like the "cp*" tags for Windows, which are different from Java. The ICU_FEATURE tag is for converter aliases like ibm-942_VASCII_VSUB_VPUA. The ICU_CANONICAL tag is for aliases like ibm-916_P100-1987 and conforms to UTR #22.
# The following is the list of recognized tags, which must be the first uncommented line.
{ IANA MIME
IBM AIX DB2
ICU ICU_FEATURE ICU_CANONICAL
JAVA
WINDOWS MSIE # MSIE is Internet Explorer, which is different from Windows
NETSCAPE # Data not available at this time.
SOLARIS
GLIBC
APPLE
HPUX
ZOS ZOS_USS # Could be OS390 and OS390_USS instead
OS400
VMS # Source of information doesn't exist aka OpenVMS from Compaq
TRU64 # Source of information doesn't exist aka OSF1 from Compaq
IRIX # Source of information doesn't exist
SCO # Source of information doesn't exist
PTX # Source of information doesn't exist
PALMOS # Source of information doesn't exist
# We could add LINUX and BSD too, but they use GLIBC
}
UTF-8 { MIME } ibm-1208 { IBM } cp1208 { JAVA }
# ....
Due to existing APIs, the platforms are called "standards" in the APIs. Since IANA and MIME are in the list of platforms, the name "standard" seems to make sense. Since standard organization names, platform vendors, and software products are in the list, the "standard" name seems to be a reasonable name within our API.
The ucnv_getPlatform() function only works on open converters, and only returns UCNV_IBM if the codepage is IBM based. This API is inflexible due to its use of enums. It prevents our users from easily adding their own platforms/standards. Since this API ignores the fact that the same converter can be based on a list of platforms and standards, like UTF-8 and iso-8859-1, this API seems less than useful and should be considered for API deprecation.
Slightly off topic, there is a function called ucnv_getDisplayName(). While it seems like a useful function, it also has no data. It just returns the canonical name like some other APIs. We should consider deprecating it.
There is already an API to get the standard's converter name based on an alias and a standard. With a tag like "ICU_CANONICAL", you can also request the canonical name.
/**
* Returns a standard name for a given converter name.
*
* @param name original converter name
* @param standard name of the standard governing the names; MIME and IANA
* are such standards
* @return returns the standard converter name;
* if a standard converter name cannot be determined,
* then NULL is returned. Owned by the library.
* @stable
*/
U_CFUNC const char * U_EXPORT2
ucnv_getStandardName(const char *alias, const char *standard, UErrorCode *pErrorCode);
There is already an API to get the list of supported platforms. Its name is a little funny, but it already exists. The current API looks like the following:
/** * Gives the number of standards associated to converter names. * @return number of standards * @stable */ U_CAPI uint16_t U_EXPORT2 ucnv_countStandards(void); /** * Gives the name of the standard at given index of standard list. * @param n index in standard list * @param pErrorCode result of operation * @return returns the name of the standard at given index. Owned by the library. * @stable */ U_CAPI const char * U_EXPORT2 ucnv_getStandard(uint16_t n, UErrorCode *pErrorCode);
There is already an API to get the list of installed converters, but we may want a new API to get the list of known converters. The current API looks like the following:
/** * returns the number of available converters, as per the alias file. * * @return the number of available converters * @see ucnv_getAvailableName * @stable */ U_CAPI int32_t U_EXPORT2 ucnv_countAvailable (void); /** * Gets the name of the specified converter from a list of all converters * contaied in the alias file. * @param n the index to a converter available on the system (in the range [0..ucnv_countAvaiable()]) * @return a pointer a string (library owned), or NULL if the index is out of bounds. * @see ucnv_countAvailable * @stable */ U_CAPI const char* U_EXPORT2 ucnv_getAvailableName (int32_t n);
There is already an API to get the list of aliases of a converter. The current API looks like the following:
/** * Gives the number of aliases for a given converter or alias name. * If the alias is ambiguous, then the preferred converter is used * and the status is set to U_AMBIGUOUS_ALIAS_WARNING. * This method only enumerates the listed entries in the alias file. * @param alias alias name * @param pErrorCode error status * @return number of names on alias list for given alias * @stable */ U_CAPI uint16_t U_EXPORT2 ucnv_countAliases(const char *alias, UErrorCode *pErrorCode); /** * Gives the name of the alias at given index of alias list. * This method only enumerates the listed entries in the alias file. * If the alias is ambiguous, then the preferred converter is used * and the status is set to U_AMBIGUOUS_ALIAS_WARNING. * @param alias alias name * @param n index in alias list * @param pErrorCode result of operation * @return returns the name of the alias at given index * @see ucnv_countAliases * @stable */ U_CAPI const char * U_EXPORT2 ucnv_getAlias(const char *alias, uint16_t n, UErrorCode *pErrorCode); /** * Fill-up the list of alias names for the given alias. * This method only enumerates the listed entries in the alias file. * If the alias is ambiguous, then the preferred converter is used * and the status is set to U_AMBIGUOUS_ALIAS_WARNING. * @param alias alias name * @param aliases fill-in list, aliases is a pointer to an array of *ucnv_countAliases()string-pointers * (const char *) that will be filled in. * The strings themselves are owned by the library. * @param pErrorCode result of operation * @stable */ U_CAPI void U_EXPORT2 ucnv_getAliases(const char *alias, const char **aliases, UErrorCode *pErrorCode);
Some new API will almost be required to implement. A new ucnv_open function will be needed, so that a codepage can be opened based upon an alias. It could look something like this:
/** * Creates a UConverter object with the names specified as a C string * based on a specified standard or platform. * The actual name will be resolved with the alias file * using a case-insensitive string comparison that ignores * the delimiters '-', '_', and ' ' (dash, underscore, and space). * E.g., the names "UTF8", "utf-8", and "Utf 8" are all equivalent. * IfNULLis passed for the converter name, it will create * one with the getDefaultName return value. * * A converter name for ICU 1.5 and above may contain options * like a locale specification to control the specific behavior of * the newly instantiated converter. * The meaning of the options depends on the particular converter. * If an option is not defined for or recognized by a given converter, * then it is ignored. * * Options are appended to the converter name string, with a *UCNV_OPTION_SEP_CHARbetween the name and the first option and * also between adjacent options. * * When the standard isNULLit will open a converter * that is most appropriate for the current platform. When a standard is * specified, it will open a converter that is most appropriate for that * standard. * * @param converterName : name of the uconv table, may have options appended * @param standard the specific converter behavior to use, which is specified by * the alias table. * @param err outgoing error status U_MEMORY_ALLOCATION_ERROR, U_FILE_ACCESS_ERROR * @return the created Unicode converter object, or NULL if an error occured * @see ucnv_openU * @see ucnv_openCCSID * @see ucnv_close * @see convrts.txt * @stable */ U_CAPI UConverter* U_EXPORT2 ucnv_openStandard(const char *converterName, const char *standard, UErrorCode * err);
The ucnv_open() function will need to change its behavior a bit. It can open an ICU preferred converter, or it can open a platform-preferred converter. It would probably be better if the converter that is opened remained consistent across all platforms, like it is now.
While the existing API does address our basic converter, some new API could be added for convenience. This functionality already exists by using the existing API. These functions may not be very fast, but these functions are just for convenience. This possible new API mirrors the ucnv_*Alias(), but it should use the new UEnumeration API.
/** * Return a new UEnumeration object for enumerating all the * alias names for a given converter that are recognized by a standard. * This method only enumerates the listed entries in the alias file. * The convrtrs.txt file can be modified to change the results of * this function. * The first result in this list is the same result given by *ucnv_getStandardName, which is the default alias for * the specified standard name. The returned object must be closed with *uenum_closewhen you are done with the object. * * @param convName original converter name * @param standard name of the standard governing the names; MIME and IANA * are such standards * @param pErrorCode The error code * @return A UEnumeration object for getting all aliases that are recognized * by a standard. If any of the parameters are invalid, NULL * is returned. * @see ucnv_getStandardName * @see uenum_close * @see uenum_next * @draft ICU 2.2 */ U_CAPI UEnumeration * ucnv_openStandardNames(const char *convName, const char *standard, UErrorCode *pErrorCode);
It should be easy to implement this new design without significant API changes. Implementing this design will require a major overhaul of the underlying data structure, which will take some time.