Summary of discussions 2005-aug..sep
Design meeting 2005-oct-12 (Markus Scherer, George Rhoten, Raghuram Viswanadha, Steven R Loomis, Vladimir Weinstein)
Development so far: The icupkg tool can now automatically discover dependencies between .cnv conversion tables (from delta-only to base tables) and between .res resource bundles (from aliases).
The dependencies between a resource bundle and its parent (according to its locale fallback) will be handled by the following:
be:table(nofallback)
{...}.nofallback bundle as
if ures_openDirect() had been called to open it. ures_openDirect()
will still behave slightly better for such files, because if the bundle is
missing altogether, it does not attempt to open a root bundle, unlike ures_open().
If the flag is missing completely, it is treated as using fallbacks.nofallback.Alternatively, we could have a single per-tree flag in the meta data for whether all bundles participate in locale fallback. However, this would mean that users would be forced to use data trees (which we think are rarely used and poorly understood) to avoid mixing locale bundles and non-locale bundles on top level.
Consider moving locale bundles to a locales/ data tree (or non-locale bundles to a misc/ tree) for better separation between data items of different types and usage.
icupkg needs to support tree names in .res aliases: /ICUDATA-coll/...
With the above, icupkg will handle the majority of item dependencies. We decided that for now we will not handle other dependencies via meta data. For issues and details for future work on this see Jitterbug 4877 RFE: ICU data packaging meta data: support explicit item dependencies on package level
Decisions for meta data work for ICU 3.6:
ICU4J:
Examples for meta data usage:
Markus Scherer, 20050824..26
Several ways to build/load/ship ICU and ICU-application data.
Hard to change data: Updating a .dat package means extracting all items and a manifest file via decmn, modifying the files and the manifest in the file system, swapping new items to the common type, repackaging with gencmn, and maybe swapping again to a new type. (All of this can be done with icupkg [below] with a single command-line invocation.) We also use index files for getAvailable() functions, and one would have to manually update these index files, at least for ICU4C.
Or, starting from a data-source download (now available only through CVS, not as .zip/.tgz) and modifying files, then packaging. Some of the index files are generated by the build.
This is about packaging and indexing. It is not about changing the format of .res files or any other ICU data files. It is also not about how ICU4J data is packaged in general (putting individual data files into the .jar).
However, as a result of the discussion, index files (at least in ICU4C) may be switched from being .res files to using the existing data format of the converter alias table. Such index files would probably use only a subset of that format's capabilities.
There is a new command-line tool that can create .dat files, swap them, add/remove/extract and list items, change copyright/comment, input package and item files in any platform type, output .dat package and extracted files in desired platform type. It does not yet handle indexes in any way. It could replace gencmn, decmn, and icuswap completely.
The icupkg tool is not finished. It needs testing/debugging, and if we decide to use automatically updated index files, such code will need to be written and integrated. Its package-handling code (mostly a C++ class) has been moved (aug26) to the toolutil library so that we can easily reuse this code in other codes, via Java JNI, etc. icupkg is part of CVS since 20050826.
What do we want to achieve? Thoughts:
Not everything here is easy, nor handled nearly completely with the above proposal.
(Emails 20050824)
Here's my thoughts from an ICU4J perspective.
My main concern is to keep it easy for Java users to build and customize ICU4J with a minimum of tooling. Currently, all you need is Ant, JDK 1.4, and the ICU4J source jar. ICU4J uses locale-specific .res files (in the main) so that removing particular resources is 'easy' (disregarding problems that arise from removing some resources that other resources alias, or messing up inheritance). Adding resources to our resource tree is not really possible, but people can build their own resource trees using standard Java resources. I suspect the main demand is to reduce footprint by removing unused resources.
Requiring an ICU4C tool and accompanying ICU4C installation is a significant addition to the toolset. I'd like to keep the common case where ICU4J users just want to trim data in 'chunks'-- e.g. remove a bunch of locales, or all data related to an unused service-- as simple as it is currently.
Changes to internal structure of the data is simply a porting issue-- ICU4J needs to know how to read the .res files. If ICU4J retains the per-locale .res file format ICU4J is pretty much untouched, it already builds its indexes at build time by looking at what resources are present during the build. They are simply a list of strings naming the locales.
More significant customization would be done (I presume) by more sophisticated users who would be willing to install ICU4C and learn (at least) to use the tool. If ICU4J retains the per-locale res file model, then ideally these users could download the ICU .dat files and then use the tool to create the per-locale .res files (I don't know the tool's capabilities) without having to know how to build ICU and run other tools to do this. One-stop shopping is a plus here.
A Java tool that provided a GUI, and used ICU4JNI or called the ICU4C tool directly, would be a plus.
Doug, your concern is perfectly placed. Let me clarify.
The proposal does not change at all how ICU4J stores and loads .res files. If someone works from a .jar file or from its extracted contents, then nothing on that level will change. Changing the internal structure of .res files is not on the table.
The part of the discussion that affects ICU4J is the question of whether to use index files for getAvailable() and for aliases/equivalents/validSubLocales, or whether instead to enumerate the relevant items. We should have the same solution (index vs. enumeration) for both ICU4C and ICU4J.
The real change here would be that the index file would come from ICU4C (and an ICU4C tool) instead of from the ICU4J build process.
In addition, if we decide to go with index files, we need to decide whether to keep them in resource bundle format, or whether to use the data format of the conversion alias table. (The latter's source text form is actually simpler to edit than resource bundle sources.)
For generating new .res files, ICU4J already relies on ICU4C tools. If we add tool support for updating index files, this may or may not mean another ICU4C tool for that.
Another possible impact of the discussion on ICU4J: Ram has written a script that takes the ICU4C data, generates the Unicode properties files, and creates the ICU4J data .jar file. If the input to this script is a .dat package, then updating the .dat package with tools would also be a way to customize ICU4J data. The index files would be updated automatically before the script gets the .dat file.
Yes, the new icupkg tool (and the old decmn tool) can take a .dat file and extract all of the .res and other items in it as separate files. (Much like unzip.)