ICU Modularization

2003.07.24, MD

This documents describes at a high level the process we are following in making ICU more modularized. This is a working document; we will not attempt to make it pretty!

Note that in many cases, there are conflicting user requests. For example, some users would like to see many small DLLs, each for a small piece of the ICU code, while others are in fact building all of ICU into a single DLL or link it all directly into their application. (This is also not to mention requests for more charset support out of the box combined with much smaller data...) Our job is to try to mediate among the different choices, providing reasonable flexibility with minimal overhead.

Note: Anything we do will have some cost, so we also have to weigh the effort involved. Shaving 50KB might not be worth a month of programmer time, for example.

Background

Many modularization features are already built into ICU:

General Issues

In the items below, if we are already planning for the implementation of a feature in a given release, the release number will appear at the end of the bullet, in parentheses.

  1. Runtime customization (as opposed to build/compile-time customization).
    1. ICU was designed to be very flexible, and allow for a great deal of customization. In the past, much of this has been by allowing compile options. We are moving towards a runtime customization model, examining the places where we expect people to customize via source changes, and prioritizing those for incorporation into runtime customizations. For C, our focus will be on two scenarios:
      1. Source Customer: product will compile ICU with its own customizations, compiler flags, etc. This version of ICU will not be shared with other products.
      2. Binary Customer: product will use stock ICU; any customizations will be supplied either at runtime or via install. Any install options will be choosing which parts of ICU to install, so that products that need other pieces can just install them.
    2. ICU4C: We are actively working on runtime (initialization-time) memory/mutex service overrides, in the form of a vector of pointers that can be passed in. To this, we will probably need to add file-system vectors, so that they can also be customized at runtime for the environment. (ICU2.8)
    3. Note: currently our service registration is not persistent. What we plan to do is to provide serialization APIs on a service-by-service basis, so that programs can save the current state of registrations, and restore them on next launch. We've discussed incorporating persistent registration directly, but we have to run in many odd environments, so feel it is better for the host program to handle the mechanics of saving/restoring the data; we will just provide the tools to do that. (ICU 3.2?)
    4. Note: we are moving to a superset of RFC 3066 IDs for our locale IDs (which are really language IDs). As a part of that, we are adding script tags, and keywords instead of variants. See http://www.openi18n.org/specs/ldml/1.0/ldml-spec.htm#Locale for more information.
  2. Basic Modularization. We should consider how ICU can be divided up into smaller pieces that can be separately installed. This involves decoupling pieces of code from one another, and decoupling data. The data may need to be decoupled both on the basis of service and on the basis of locale. Some of this is underway, while much remains to be done. See http://source.icu-project.org/repos/icu/icuhtml/trunk/design/icu_footprint.html for figures.
    1. Data Modularization (J&C)

      The biggest bang for the buck by far is in data modularization, since that is where the bulk of the storage is (8,373KB in ICU4C):

      1. The plan in ICU 2.8 is to move service data into separate resource-bundle trees. This is for three reasons: (a) modularization, (b) validity testing, and (c) non-localizability. As for (b), under the ICU locale model, data for a service is assumed to be valid if there is a resource bundle for a locale for that service, even if the data is inherited from a parent. For some services we don't have complete data, so we will reflect that by having a separate tree. As for (c), there are some resources in root that cannot validly be overridden in locales, so they need to be moved out. (ICU 2.8)
        1. Collation. This is for size.
        2. RBNF. This is because of incomplete data
        3. TimeZone Names. This is because of incomplete data
        4. Base Currency Information. This is non-localizable information (such as how many decimals a currency uses). Currently some of this is in root, but is never accessed by locale ID.
      2. The resource trees can also be further divided, e.g. most collation tailorings in the base file, and the large East Asian tailorings into separately installable files.
      3. We will be adding the capability to incrementally add (by installation) charset converters and charset alias information. One can do the former now, but it is not as easy as it should be, and one can't add alias information. (ICU 3.0)
    2. ICU4J code modularization.

      This was already delivered in Version 2.6, although we may refine it further.

    3. ICU4C Code Modularization.
      1. We are considering splitting up the i18n library (688KB) into 4 pieces: collation, transforms, formatting, regex. (ICU 3.0?)
      2. We are considering splitting the common library (586KB) into legacy charset conversion and the rest (base charset conversion + non-conversion). The base charset conversion would be UTFs, Latin-1, ASCII; the legacy would include all others (SJIS, ISO-2022, GB 18030, SCSU, etc.). (ICU 3.0?)
        • Other possibilities for splitting are UnicodeSet and BreakIterator, but we don't know if these are worth it.
      3. Because there is no good cross-platform way to dynamically install C code, what we plan to do is to build both the full module and a stub module. At install time, one can choose to install either one (if the full module is already installed, it should remain there, since some other installation has added it!) The stub APIs will simply return errors if called; or provide some kind of very simple fallback behavior. (ICU 3.0?)
      4. We have investigated whether we can overlap certain large CJK charset conversion tables (without sacrificing speed), and believe we can do so. If we do that, we can save about 1MB in the standard release. We can also add additional conversion tables to the standard release at a much smaller delta cost. (ICU 3.0?)
      5. Possibility. Some people need the ability to run, say, collation, without having ever to build custom rules. The builders are typically larger than the running functionality. We could supply stub libraries that omitted the builders. These would only support the runners, not the builders, returning an error if someone tried to build from rules.
      6. Question: what granularity do clients want from us? Would they be interested in saving one or more 100KB chunks by excluding one or more of these pieces?
    4. Question: we need to find out which is more important for clients, the distribution size (e.g. what is on the installation CD) or the final installed size (what ends up on the end-client's machine). From the feedback we have gotten back, it is the former that is more important. Given that, there are certain changes we can make. Examples:
      1. For speed the ICU data is in machine-endian format, with keys using the machine invariant set (ASCII vs EBCDIC). That means that there are 3 versions of the data files. What we will do is have a single data file on the distribution. For  installation, we will have a tool that changes the endianness of the file on the distribution medium to that of the machine (ICU 2.8)
      2. Possibility. For collation and breaks, we precompile build rules into a binary format that can be memory-mapped. This speeds initialization substantially. We currently distribute the binaries for these, but we could instead supply a tool to build them on installation instead. Or in a small disk memory-device, we can dispense with the binaries and always build from rules (slower startup). Note: the Collation binaries amount to 1.2MB, so this can involve substantial savings on the distribution medium.
      3. Possibility. A related possibility is to have infrequently-used data (e.g. uncommon locales) be installed in a compressed form. We would uncompress it in memory on use, but leave the disk copy compressed. This would cost a bit more on startup (for that infrequent data), but save space on disk.
    5. Question: We need to find out whether ICU services would ever be used on small devices (embedded/pervasive systems PDAs, Cellphones, etc), since that could color the way we componentize it. There may be other requirements, such as division of data and/or code into chunks of < 64K.
  3. Feature Queries. Since different pieces of ICU or service data for different locale may be installed, code -- whether ours or other's -- may need to query the capabilities of the system. That is, which services are installed, and for which locales?
    1. Service queries. We will have an API that will take an enum for each service and test whether that service is fully installed. In C this will test for the stub library; in Java we will use reflection to test the presence of key classes for the service. For ICU4J we will also build into the manifest which services are available.
    2. Data queries. If tailored data is not available, we guarantee that there is some reasonable fallback behavior (e.g. UCA for collation).
    3. Available Locales / Charsets. To save having to query the file system (which may be tricky to do in a clean cross-platform way), we put the available locales (for a given resource tree), charsets, etc into index files (res_index.res for each resource trees and cnvalias.icu for charset aliases). Any install process must update those index files. This needs to be well documented, and tools need to be supplied for installation.
      • Each charsets index file may contain more than the actual installed items. It just has to contain a superset of them. At runtime, we will determine the actual installed items by seeing which ones in the list are available.
      • For resource trees, on the other hand, we assume that the list is valid (this may change in the future).
      • In the future, we may have a function pointer for querying the file system, which would allow us to do away with the index files. This function pointer would behave like mutex, tracing, etc; it would point to platform-specific code, but be overridable at runtime.
  4. Binary Compatibility. A near-term goal is to have binary compatibility across main versions (e.g. 2.6 to 2.8). We are have a process in place for compatibility, but need some additional work for full binary compatibility.
    1. Currently, once a public API has graduated from draft status, we only allow removal if it is dangerously broken (e.g. with unfixable memory ownership issues). Even then, we mark them as deprecated, and keep them for at least two releases. See ICU Development Overview. Binary compatibility will imply keeping them forever (but returning an error if called).
    2. Binary compatibility means C and Java, not C++.
      • C++ is very badly designed for binary compatibility, so it would be extremely painful to try for that. Not only is the v-table architecture not well suited for binary compatibility, even on the same platform different compilers are not compatible. So we will not even attempt it.
      • Note: In C we provide a 'rename' feature that allows applications to link against multiple versions of ICU simultaneously (e.g. for supporting both ICU 2.4 and 2.8 collation on different databases). This must be turned off if people want binary compatibility.
    3. We need a simple way for applications to test whether they are using draft, internal, or deprecated API, so that they can be warned that it will cause them problems. Here is how we will do that:
      1. For ICU4C, we will surround all internal (non-public), draft, deprecated, and obsolete APIs with #ifndef U_STABLE_ONLY. That way people will be able to compile their code with a flag set on, and test whether they are using any APIs that are not guaranteed to maintain binary compatibility. (ICU2.8)
      2. We need to come up with a similar mechanism for Java. The Java compiler does have the deprecated flag, but has no notion of @draft or @internal.
        • Note: it is possible to have the equivalent of "friend" classes in Java (see http://www.macchiato.com/ "Equality among Friends"), although it is kind of a pain to do. However, if we wanted to restrict access formally, we could replace @internal and the impl directory with this mechanism.
    4. For ICU4C, our tests are a mixture of C and C++ ; we will need to flesh out the C tests for checking binary compatibility, and separate the tests of public, stable API from the others. For each new release, we will need to run the old tests against the new code (with renaming off), to verify binary compatibility. For example, we will run the ICU 2.6 tests against the ICU 2.8 code. For ICU 2.8, we will start separating the tests; we expect to complete this work in ICU 3.0.
    5. There are a few cases, such as Regex, where we do not have C API. We'll have to address this so that people can have binary compatibility.
  5. ICU Remote Services. This is a much more speculative area. We could have ICU services (e.g. charset conversion, collation, etc) be served over the internet. Examples:
    1. A product has a minimal install of ICU. If it needs to convert an unusual character set, it downloads the data for that charset from the internet and uses it.
    2. A product has no install of ICU. If it needs to convert an unusual charset, it sends the bytes to be converted to an ICU server, which sends back the result.