ICU Modularization
2003.07.24, MD
This documents describes at a high level the process we are following in
making ICU more modularized. This is a working document; we will not attempt to
make it pretty!
Note that in many cases, there are conflicting user requests. For example,
some users would like to see many small DLLs, each for a small piece of the ICU
code, while others are in fact building all of ICU into a single DLL or link it
all directly into their application. (This is also not to mention requests for
more charset support out of the box combined with much smaller data...) Our job
is to try to mediate among the different choices, providing reasonable
flexibility with minimal overhead.
Note: Anything we do will have some cost, so we also have to weigh
the effort involved. Shaving 50KB might not be worth a month of programmer
time, for example.
Background
Many modularization features are already built into ICU:
- All platform-specific code is isolated in a small number of files, for
ease of porting and maintenance.
- Most of ICU has always been data-driven, with rules and patterns. Most of
the data has always been loaded from files.
- We have reduced the startup time by changing our data model from loading
and parsing source files to binary, memory-mappable files. Startup time was
very important to our users. This has cost some flexibility in terms of
endianness and charset family, but we are in the process of getting that
back.
- ICU can be modularized in many ways at build-time, using compile flags to
exclude code and/or data.
- We have replaced some build-time options with runtime choices. For
example, with ICU 1.4 a user had to choose at build time whether to load a
DLL data package or a .dat package. A later version did this automatically,
finally by always loading the DLL statically and falling back to the .dat
file if the DLL is the empty stub file.
- We have moved data from source code files into loadable data files; for
example, Unicode properties and normalization. This made the code a lot
smaller and allows post-installation Unicode updates. If someone does not
need normalization, then the data file can now be cut.
- We allowed post-installation overrides of data by loading individual files
before searching for them in a package. This can be turned off, to allow for
better performance.
- Applications can load conversion tables from their own data package.
- At runtime, programs can register different services ("service
registration), thus altering the service (such as collation) that is fetched
for a given locale. This can thus be used for customization. It can also be
used to register specialized plug-in engines, such as a Japanese Word-Break
engine.
General Issues
In the items below, if we are already planning for the implementation of a
feature in a given release, the release number will appear at the end of the
bullet, in parentheses.
- Runtime customization (as opposed to build/compile-time customization).
- ICU was designed to be very flexible, and allow for a great deal of
customization. In the past, much of this has been by allowing compile
options. We are moving towards a runtime customization model, examining
the places where we expect people to customize via source changes, and
prioritizing those for incorporation into runtime customizations. For C,
our focus will be on two scenarios:
- Source Customer: product will compile ICU with its own
customizations, compiler flags, etc. This version of ICU will not be
shared with other products.
- Binary Customer: product will use stock ICU; any customizations
will be supplied either at runtime or via install. Any install
options will be choosing which parts of ICU to install, so that
products that need other pieces can just install them.
- ICU4C: We are actively working on runtime (initialization-time)
memory/mutex service overrides, in the form of a vector of pointers that
can be passed in. To this, we will probably need to add file-system
vectors, so that they can also be customized at runtime for the
environment. (ICU2.8)
- Note: currently our service registration is not persistent.
What we plan to do is to provide serialization APIs on a
service-by-service basis, so that programs can save the current state of
registrations, and restore them on next launch. We've discussed
incorporating persistent registration directly, but we have to run in
many odd environments, so feel it is better for the host program to
handle the mechanics of saving/restoring the data; we will just provide
the tools to do that. (ICU 3.2?)
- Note: we are moving to a superset of RFC 3066 IDs for our
locale IDs (which are really language IDs). As a part of that, we are
adding script tags, and keywords instead of variants. See http://www.openi18n.org/specs/ldml/1.0/ldml-spec.htm#Locale
for more information.
- Basic Modularization. We should consider how ICU can be divided up
into smaller pieces that can be separately installed. This involves
decoupling pieces of code from one another, and decoupling data. The data
may need to be decoupled both on the basis of service and on the basis of
locale. Some of this is underway, while much remains to be done. See
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/icu_footprint.html
for figures.
- Data Modularization (J&C)
The biggest bang for the buck by far is in data modularization, since
that is where the bulk of the storage is (8,373KB in ICU4C):
- The plan in ICU 2.8 is to move service data into separate
resource-bundle trees. This is for three reasons: (a)
modularization, (b) validity testing, and (c) non-localizability. As
for (b), under the ICU locale model, data for a service is assumed
to be valid if there is a resource bundle for a locale for that
service, even if the data is inherited from a parent. For some
services we don't have complete data, so we will reflect that by
having a separate tree. As for (c), there are some resources in root
that cannot validly be overridden in locales, so they need to be
moved out. (ICU 2.8)
- Collation. This is for size.
- RBNF. This is because of incomplete data
- TimeZone Names. This is because of incomplete data
- Base Currency Information. This is non-localizable information
(such as how many decimals a currency uses). Currently some of
this is in root, but is never accessed by locale ID.
- The resource trees can also be further divided, e.g. most
collation tailorings in the base file, and the large East Asian
tailorings into separately installable files.
- We will be adding the capability to incrementally add (by
installation) charset converters and charset alias information. One
can do the former now, but it is not as easy as it should be, and
one can't add alias information. (ICU 3.0)
- ICU4J code modularization.
This was already delivered in Version 2.6, although we may refine it
further.
- ICU4C Code Modularization.
- We are considering splitting up the i18n library (688KB) into 4
pieces: collation, transforms, formatting, regex. (ICU 3.0?)
- We are considering splitting the common library (586KB) into
legacy charset conversion and the rest (base charset conversion +
non-conversion). The base charset conversion would be UTFs, Latin-1,
ASCII; the legacy would include all others (SJIS, ISO-2022, GB
18030, SCSU, etc.). (ICU 3.0?)
- Other possibilities for splitting are UnicodeSet and
BreakIterator, but we don't know if these are worth it.
- Because there is no good cross-platform way to dynamically install
C code, what we plan to do is to build both the full module and a
stub module. At install time, one can choose to install either one
(if the full module is already installed, it should remain there,
since some other installation has added it!) The stub APIs will
simply return errors if called; or provide some kind of very
simple fallback behavior. (ICU 3.0?)
- We have investigated whether we can overlap certain large CJK
charset conversion tables (without sacrificing speed), and believe
we can do so. If we do that, we can save about 1MB in the standard
release. We can also add additional conversion tables to the
standard release at a much smaller delta cost. (ICU 3.0?)
- Possibility. Some people need the ability to run, say,
collation, without having ever to build custom rules. The builders
are typically larger than the running functionality. We could supply
stub libraries that omitted the builders. These would only
support the runners, not the builders, returning an error if someone
tried to build from rules.
- Question: what granularity do clients want from us? Would
they be interested in saving one or more 100KB chunks by excluding
one or more of these pieces?
- Question: we need to find out which is more important for
clients, the distribution size (e.g. what is on the installation CD) or
the final installed size (what ends up on the end-client's machine). From
the feedback we have gotten back, it is the former that is more
important. Given that, there are certain changes we can make.
Examples:
- For speed the ICU data is in machine-endian format, with keys
using the machine invariant set (ASCII vs EBCDIC). That means that
there are 3 versions of the data files. What we will do is have a
single data file on the distribution. For installation, we
will have a tool that changes the endianness of the file on the
distribution medium to that of the machine (ICU 2.8)
- Possibility. For collation and breaks, we precompile build
rules into a binary format that can be memory-mapped. This speeds
initialization substantially. We currently distribute the binaries
for these, but we could instead supply a tool to build them on
installation instead. Or in a small disk memory-device, we can
dispense with the binaries and always build from rules (slower
startup). Note: the Collation binaries amount to 1.2MB, so this can
involve substantial savings on the distribution medium.
- Possibility. A related possibility is to have
infrequently-used data (e.g. uncommon locales) be installed in a
compressed form. We would uncompress it in memory on use, but leave
the disk copy compressed. This would cost a bit more on startup (for
that infrequent data), but save space on disk.
- Question: We need to find out whether ICU services would ever
be used on small devices (embedded/pervasive systems PDAs, Cellphones,
etc), since that could color the way we componentize it. There may be
other requirements, such as division of data and/or code into chunks of
< 64K.
- Feature Queries. Since different pieces of ICU or service data for
different locale may be installed, code -- whether ours or other's -- may
need to query the capabilities of the system. That is, which services are
installed, and for which locales?
- Service queries. We will have an API that will take an enum for
each service and test whether that service is fully installed. In C this
will test for the stub library; in Java we will use reflection to test
the presence of key classes for the service. For ICU4J we will also
build into the manifest which services are available.
- Data queries. If tailored data is not available, we guarantee
that there is some reasonable fallback behavior (e.g. UCA for
collation).
- Available Locales / Charsets. To save having to query the file
system (which may be tricky to do in a clean cross-platform way), we put
the available locales (for a given resource tree), charsets, etc into
index files (res_index.res for each resource trees and cnvalias.icu for
charset aliases). Any install process must update those index files.
This needs to be well documented, and tools need to be supplied for
installation.
- Each charsets index file may contain more than the actual
installed items. It just has to contain a superset of them. At
runtime, we will determine the actual installed items by seeing
which ones in the list are available.
- For resource trees, on the other hand, we assume that the list is
valid (this may change in the future).
- In the future, we may have a function pointer for querying the
file system, which would allow us to do away with the index files.
This function pointer would behave like mutex, tracing, etc; it
would point to platform-specific code, but be overridable at
runtime.
- Binary Compatibility. A near-term goal is to have binary
compatibility across main versions (e.g. 2.6 to 2.8). We are have a process
in place for compatibility, but need some additional work for full binary
compatibility.
- Currently, once a public API has graduated from draft status,
we only allow removal if it is dangerously broken (e.g. with unfixable
memory ownership issues). Even then, we mark them as deprecated, and
keep them for at least two releases. See ICU
Development Overview. Binary compatibility will imply keeping them
forever (but returning an error if called).
- Binary compatibility means C and Java, not C++.
- C++ is very badly designed for binary compatibility, so it
would be extremely painful to try for that. Not only is the v-table
architecture not well suited for binary compatibility, even on the
same platform different compilers are not compatible. So we will not
even attempt it.
- Note: In C we provide a 'rename' feature that allows
applications to link against multiple versions of ICU simultaneously
(e.g. for supporting both ICU 2.4 and 2.8 collation on different
databases). This must be turned off if people want binary
compatibility.
- We need a simple way for applications to test whether they are using draft,
internal, or deprecated API, so that they can be warned
that it will cause them problems. Here is how we will do that:
- For ICU4C, we will surround all internal (non-public), draft,
deprecated, and obsolete APIs with
#ifndef U_STABLE_ONLY.
That way people will be able to compile their code with a flag set
on, and test whether they are using any APIs that are not guaranteed
to maintain binary compatibility. (ICU2.8)
- We need to come up with a similar mechanism for Java. The Java
compiler does have the deprecated flag, but has no notion of @draft
or @internal.
- Note: it is possible to have the equivalent of
"friend" classes in Java (see http://www.macchiato.com/
"Equality among Friends"), although it is kind of a
pain to do. However, if we wanted to restrict access formally,
we could replace @internal and the impl directory with this
mechanism.
- For ICU4C, our tests are a mixture of C and C++ ; we will need to
flesh out the C tests for checking binary compatibility, and separate
the tests of public, stable API from the others. For each new release,
we will need to run the old tests against the new code (with renaming
off), to verify binary compatibility. For example, we will run the ICU
2.6 tests against the ICU 2.8 code. For ICU 2.8, we will start
separating the tests; we expect to complete this work in ICU 3.0.
- There are a few cases, such as Regex, where we do not have C API.
We'll have to address this so that people can have binary compatibility.
- ICU Remote Services. This is a much more speculative area. We could
have ICU services (e.g. charset conversion, collation, etc) be served over
the internet. Examples:
- A product has a minimal install of ICU. If it needs to convert an
unusual character set, it downloads the data for that charset from the
internet and uses it.
- A product has no install of ICU. If it needs to convert an unusual
charset, it sends the bytes to be converted to an ICU server, which
sends back the result.