ICU has one "tree" for its locale-based data (of course, applications can have other trees for their own locale-based data). That is, each locale resource bundle has a name consisting of only the locale ID, so that we cannot distinguish between a "de" bundle for collation and a "de" bundle for RBNF (RuleBasedNumberFormat), for example.
Our data loading APIs (udata.h) and resource bundle APIs (ures.h) and similar take essentially two parameters for the naming, with a third mostly implied:
This design does not allow us to modularize the use of ICU, that is, to easily cut out certain types of locale-based data. (For more about modularization, see http://source.icu-project.org/repos/icu/icuhtml/trunk/design/modularization/icu_modularization_overview.html) It is also not easy to support per-service enumerations of supported locale IDs.
We leave the APIs unchanged syntactically and parse another separator in the "basename" part of the "path/basename" parameter. We separate the "basename" into a "package name" and a "tree name" with a dash (hyphen '-'). We do this inside the generic udata_ API (as opposed to only in resource bundle code) so that we can (potentially) have multiple "trees" not just for resource bundles but also for other kinds of data (e.g., with an EBCDIC "tree" for conversion).
Nested "subtrees" are separated with further dashes.
We currently prepend the package name to the filenames of binary files. In order to not use extremely long filenames, and to better show the data organization, we will replace this by organizing binary files into folders; the top folder will have the package's name, and nested subfolders will have the tree and subtree names. Similarly, source filenames will not get the tree name(s) prepended, but will be placed in separate folders. Source file folders need not be organized along the same structure though.
Using folders like this probably means that most of our data build tools (e.g., genrb) need not be modified. The makefiles specify the appropriate source and destination folders, and do not give a package name to (most of) the build tools. Dependencies of binary output files on text input files can be simplified because the file basenames now match, without package name prefixes.
For example, source/data/locales/de.txt will be split, and the collation part of it will be put into a de.txt in a new folder, e.g., in source/data/locales/collation/ or in source/data/collation/ genrb will be called with the fully specified source and destination folder names but without the package name parameter, and will generate source/data/out/icudt26l/de.res for the file with the remaining localizations and source/data/out/icudt26l/collation/de.res for the collation file. Note that this replaces the current source/data/out/build/ folder with a source/data/out/icudt26l/ folder to cut one hierarchy level.
Index files, which we currently use for uloc_getAvailable(), will be generated per tree and used in the appropriate service APIs like ucol_getAvailable(). This is an improvement in the accuracy of the getAvailable() information. In the future, as a separate work item, we may introduce runtime enumeration of package entries and of filesystem files to make this more dynamic.
Currently, data loading roughly works as follows - the first location where we find the data wins (looking for a package uses a package cache before looking in the paths):
For ICU, the package name defaults to something like icudt26l where "26" is the ICU version and "l" is for little-endian data ("b" and "e" for other platforms, "e" for big-endian EBCDIC).
With the above additional "tree name(s)", this should change to
This not only allows us to distinguish between multiple "de" files for different trees, but also to have separate .dat files, one per tree (& subtree).
For example, we could have a "collation" tree for collation, an "rbnf" tree for RuleBasedNumberFormat, a "calendar" tree for calendar data and date/time formats, ... The important thing is that we need at least a tree for each "service" so that we can report usable information for what locale IDs are actually implemented (ucol_getAvailable() and fallback/using-default conditions). Further subdivisions are possible if they don't get in the way of this mechanism and help with modularization.
Note:
If we wanted separate (sub)trees for subsets of conversion tables, then we would need to think about how we express them in the alias table (convrtrs.txt) so that from the lookup result we know the tree or subtree. The tree would not be part of the charset name matching, which might require some code and/or data structure adjustments. For example:
# table in the default tree ibm-5348 aliases... # table in the EBCDIC tree ebcdic/ibm-1140 aliases...
First circulated internally on 2003-jul-08, then modified in discussion.
markus