This document describes the proposed revisions to Collation in ICU. These revisions are for full UCA compliance, much better performance, much smaller sortkeys, faster initialization, smaller memory footprint, smaller disk footprint, additional parametric control, and additional tailoring control.
The techniques applied in this implementation are useful for a broader audience than simply collation. While Unicode is far, far more efficient than dealing with a multiplicity of separate code pages, the sheer size of the character set (from 0 to 10FFFF16) means that different types of algorithms and data structures are useful. The mechanisms used in collation can be useful in many other types of processing.
However, collation itself does have many special characteristics. The goal is to have a uniform, fast mechanism for providing an ordering of strings that matches, to the extent possible, user perceptions of the "correct" order for their language. This involves features like multiple strength levels, contracting characters, and expanding characters, plus oddities such as French secondaries or Thai reordering. We will not dive into a discussion of these features; instead, readers must first read and be familiar with the following documents.
Section 5.17 Sorting and Searching from The Unicode Standard
Without being familiar with these documents, the rest of this document will make little sense.
Note: This is a working document, and often does not have the cleanest exposition. It has evolved with the development project, and documents most heavily the parts that we found to be the trickiest. See Modifications for the latest document changes.
This document is liberally sprinkled with code fragments. These are meant to be illustrative, and will not necessarily track or match the final implementation, nor even necessarily be consistent with one another! The actual code can be found in the following list.
The fundamental data element used in collation is the collation element (CE). The basic algorithm for sort keys is to fetch a collation element for each character, then extract different level fields from the CE, route those level fields to different weight lists, then concatenate the weight lists together. This is described much more detail in UTS #10: Unicode Collation Algorithm in some detail, and we won't repeat that discussion here.
ICU uses a type of collation element that differs somewhat from what is in UCA. It is called a fractional collation element, and described below. Although it differs in format from the integral collation element used in UCA, it can be used to produce the same results, but with the advantages of smaller sort-key size.
The following section was written originally in FAQ form. It has not been restructured yet to fit in with the format of the rest of the document.
Q: A collation weight in the UCA is defined to be a 16 bit value (aka wyde). This will surely fail when Unicode 3.1 rolls around, with over 90,000 characters!
A: The UCA makes provision for more than 64K weight values: see Section 6.2 Large Weight Values and also 6.3.2 Escape Hatch. This mechanism is also used in Weight Derivation, as in 7.1.2 Legal code points. It discusses using a sequence of two collation elements, of the form:
[(X1+1).0000.0000], [yyyy.zzzz.wwww]
Q: I find this hard to follow. Is there any other way to explain the issue?
Ok. We can look at the weights in a different way, which may help to clarify what is going on.
We'll define fractional collation elements to be the same as standard collation elements except that each weight is a fraction between zero and one (instead of being an integer from 1 to FFFF). For ease in working with computers, these are binary fractions, represented as hexadecimal.
Examples:
[0.000000000, 0.920000000, 0.02057A900]
[0.100000000, 0.A30000000, 0.02057A900]
[0.1202C456B, 0.78AF00000, 0.023A90000]
With fractional collation elements, it is easy to see that all Unicode code points (including the supplementary code points) could have distinct primary mappings: there are innumerably many more than 10FFFF possible fractions!
Q: Is that all there is to it?
Not quite. We still will have to turn these fractional collation elements into well-formed standard collation elements that we can use to build a sort key. To do that, we put some restrictions on the allowable values for the fractional weights. By adopting these restrictions, we make the conversion very simple, without limiting the indefinitely large range offered by fractional weights.
Consider a fractional weight as broken into a sequence of bytes of 2 hex digits each, excluding any trailing bytes that would be entirely zero, and omitting the leading "0.". (We could give a precise numeric definition, but it is easier to think of it as simply taking bytes at a time.)
Example:
0.12C456A000000... breaks into four bytes: 12 C4 56 A0
0.12C456A320000.... breaks into five bytes: 12 C4 56 A3 20
So the first example of fractional collation elements becomes:
Examples:
[, 92,
02 05 7A 90]
[10, A3,
02 05 7A 90]
[12 02 C4 56 B0, 78 AF, 02 3A 90]
Since we eventually will be showing that we could convert these fractional collation weights into standard ones, we will put some restrictions on the values taken by these fractions, based (not surprisingly) on their bytes. Since we have wide latitude in choosing the precise values for the fractional weights in setting up a collation table, these restrictions are not at all onerous.
R1. No byte can be 00, 01, or 02.
The reason for this rule is to avoid collision with the level separator and with null bytes when the fractional weight is eventually used in a sort key. The 02 is also used for Merge Comparison.
Example:
12 C4 00 50 violates R1, since the third byte is zero.
R2. A fractional weight cannot exactly match the initial bytes of another fractional weight at the same level.
The reason for this rule is to avoid having sort keys where the starting bytes from one string are compared against the trailing bytes from another.
Example:
The two primary weights A3 92 12 C4 50 and A3 92 12 C4 violate R2. If the second weight were A3 92 12 C5 or A3 92 12 C4 52, it would not violate R2.
Allowing fractions to break this rule would cause a problem when these bytes are pushed into a sort key (see next question). Let's take an example where we just concentrate on the primary weights. Suppose x[1] = A3, y[1] = A3 23, and a[1] = 49. Then we would get the following ordering:
a [49 01...] x [A3 01...] y [A3 23 01 ...] xa [A3 49 01...] Because the primary weights turn into different lengths, and they don't follow R2, we get incorrect behavior. If R2 is followed, this can never happen, since "x" and "y" would have to differ at some point before we ran out of bytes on one side.
R3. No fractional collation element can have a zero weight at Level N and a non-zero weight at Level N-1. Any collation elements that violate this rule are called ill-formed.
The reason for this rule is to avoid allowing character to transpose, and still have the same sort key (cf. UCA §4.4, Step 4: Compare).
Any fractional collation element that does not meet these restrictions is called ill-formed.
Q: Once I have a well-formed fractional collation element table, how do I generate a sort key?
A fractional collation element table can easily be transformed into a standard one. Each fractional collation element is transformed into a sequence of one or more standard collation elements:
Break each fractional weight into a sequence of bytes.
Take two bytes from each level to form a collation element.
If there is an odd number of bytes, use 02 for the second byte.
If there are no more bytes for a particular level, use zero for the that level.
If there are no more bytes for all levels, stop.
Example:
| Fraction Collation Element | UCA Collation Element | In Sort Key |
|---|---|---|
| [12 02 C4, 78, 03] | [1202.7802.0302], [.C402.0000.0000] | [12 02 C4 02 00 00 78 02 00 00 03 02] |
Using this transformation, two fractional collation elements will have the same relative ordering as their derived UCA collation element sequences. Because the fractional collation elements can handle all Unicode code points (even supplementary code points, above U+FFFF), so can the derived UCA collation elements sequences.
All but the first collation element in the derived sequence are called continuation collation elements. If you now look back at the discussions in Section 6.2 Large Weight Values, 6.3.2 Escape Hatch, and 7.1.2 Legal code points, you will see continuation collation elements that implicitly represent fractional collation elements.
Q: Aren't the continuation collation elements in the above example ill-formed?
The text of the UCA is not as clear as it should be on that point. It says:
Except in special cases, no collation element can have a zero weight at Level N and a non-zero weight at Level N-1. Any collation elements that violate this rule are called ill-formed. The reason for this will be explained under Step 4 of the main algorithm.
The "special cases" referred to in the text are precisely the continuation collation elements that would result from generating the collation element table from a fractional collation element table. A reformulation in terms of fractional collation elements clears this up.
Q: In tailoring, I need to put a collation element between two others. How can I do this without changing the original two values?
The easiest way to do this is to view the collation element table as fractional collation elements, as described in the previous questions. If you construct your original table so that you leave a bit of room between adjacent collation elements, then you can always find intermediate values for the weights at any level.
Q: What do you mean by "a bit of room"?
For two adjacent collation elements in the table, just make sure that for each level there is at least one valid, extensible fractional weight between the weights from those elements. (Extensible means that the weight could be of any length, e.g. without breaking R2.)
Example:
AB C3 and AB D0 have room:
One could insert 13 different 2-byte fractions: AB C4, AB C5, ..., AB CC; or many more 3 or more byte fractions.
AB CD and AB CF have room:
One could insert AB CE, or one could insert many more 3-byte fractions: AB CE 02, AB CE 03, ...
AB CD and AB CE don't have room.
While fractions of the form AB CD xx are between these values, they would violate R2 above.
AA FF and AB 02 don't have room.
Inserting AA FF xx or AB would violate R1.
inserting AB 00 or AB 01 would violate R2.
Q: So how do I determine the intermediate values?
First, determine how many weights you need, and then how many valid weights are between the two given weights. Unless the fractional weights have the same number of bytes and only differ in the last byte, there will usually be far more weights than you need. If you know more about the relative frequency of the characters in text, you can choose shorter weights for the more frequent weights.
More precisely, see Intermediate CEs.
Q: I use the mechanisms in UCA §6.1.2, L2/L3 in 8 bits to reduce my sort-key to less than half the size. How can I use this continuation/tailoring method with bytes instead of wydes?
In the above, instead of adding 02 to odd-byte-length weights, leave the bytes zero. When composing the sort key, just omit any zero bytes when composing the sort key. Use 01 for the LEVEL_SEPARATOR. Thus the above example would become the considerably shorter key below:
| Fraction Collation Element | UCA Collation Element | In Sort Key | |
|---|---|---|---|
| Old | [12 02 C4, 78, 03] | [1202.7802.0302], [.C402.0000.0000] | [12 02 C4 02 00 00 78 02 00 00 03 02] |
| New | [12 02 C4, 78, 03] | [1202.7800.0300], [.C400.0000.0000] | [12 02 C4 01 78 01 03] |
ICU uses a single 32-bit value in the code and tables to represent a fractional collation element (FCE: see Fractional Collation Elements) into 32-bit chunks, in a way that can be easily used in generating sort keys. Since sometimes 32 bits is not enough, and sometimes exceptional processing must be handled, there are different forms of CE that distinguished by whether the first nybble is F or not.
The normal CE is of the following form. P is primary, S is secondary, C is case/continuation, and T is tertiary.
| P | P | P | P | P | P | P | P | P | P | P | P | P | P | P | P | S | S | S | S | S | S | S | S | C | C | T | T | T | T | T | T |
| 16b | 8b | 2b | 6b | ||||||||||||||||||||||||||||
The first nybble of the primary can never be F; this constraint is maintained by the data builder. The Case/Continuation value are used for two purposes: as a case value, or to indicate a continuation. When used for case, it can either be used as part of the case level, or considered part of the tertiary weight. In either case, a parameter can be used to invert it, thus changing whether small is before large or the reverse. That parameter can be either set in the rules or by a set call.
| CC | Group | Description |
|---|---|---|
| 00 | Small | lowercase letters, uncased letters, and small kana |
| 01 | Mixed | a mixture of small and large. Does not occur in the UCA, but may occur in contractions, such as "Ch" in Slovak. |
| 10 | Large | uppercase letters and large kana |
| 11 | Continuation | Continuation CEs can only occur in Expansions (although not all Expansion CEs will be Continuations). A continuation is used when the entire fraction for any of the weight levels cannot fit into one CE. |
Note: to meet validity constraints, a tertiary can only be zero (IGNORE) if the primary and secondary are zero; the secondary can only be zero if the primary is zero. This constraint is managed by the data builder.
The special CE is of the following form (where T = tag, d = data). The first nybble is F, to distinguish it from other cases. These are only used internal to the data table.
| F | F | F | F | T | T | T | T | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d | d |
| 4b | 4b Tag | 24 bit data | |||||||||||||||||||||||||||||
The tags have the values:
| NOT_FOUND_TAG | 0 |
|---|---|
| EXPANSION_TAG | 1 |
| CONTRACTION_TAG | 2 |
| THAI_TAG | 3 |
| CHARSET_TAG | 4 |
| SURROGATE_TAG | 5 |
| reserved | 6+ |
To test whether a ce is an extension CE, we use:
if (ce >= MIN_SPECIAL) ...
const int CONTINUATION_MASK = 0x80; const int CASE_MASK = 0x40; const int MIN_SPECIAL = 0xF0000000; const int MIN_VALUE = 0x02; const int UNMARKED = 0x03;
The following sample shows how this would work in practice. (ScriptOrder will be explained later).
// once we have a cc. Special CEs have already been handled.
continuation = (ce & 0x80) != 0;
ce ^= caseSwitch; // handle case bit, switching
caseBit = ce & CASE_MASK;
wt = ce & tertiaryMask;
ws = (ce >> 8) & 0xFF;
wp2 = (ce >> 16) & 0xFF; // second primary byte
wp1 = (ce >> 24); // first primary byte
if (scriptOrder != null && !continuation) {
wp1 = scriptOrder[wp1];
}
if (wp1 != 0) *primary++ = wp1;
if (wp2 != 0) *primary++ = wp2;
if (doSecondary) {
if (ws != 0) *secondary++ = ws;
if (doTertiary) {
if (wt != 0) *tertiary++
}
}
Note: in practice, the speed of collation is very dependent on the number of instructions in the inner loops. We will have a separate version of the loop that we will use for the common case: TERTIARY, no extra case level, IGNORABLE = ON. As it turns out, this is a very significant performance win. We may add other special-case loops as well.
If the FrenchSecondary flag is turned on, then the secondary values in the continuation CEs are reversed. This is so that when the secondary buffer itself is reversed (see below), the continuation values come out in the right order. This is done by the following pseudocode (where specialHandling is above)
if ((ce & FLAGS_MASK) == CONTINUATION_MASK) {
if (frenchStartPtr == null) frenchStartPtr = secondary - 1;
frenchEndPtr = secondary + 1;
} else if (frenchStartPtr != null) {
//reverse secondaries from frenchStartPtr up to frenchEndPtr
}
plus some code at the very end, after processing all CEs, to catch the final case.
if (frenchStartPtr != null) {
//reverse secondaries from frenchStartPtr up to frenchEndPtr
}
Note: In incrementally comparing strings, as opposed to sort keys, more work has to be done with French secondaries. Essentially, all the secondaries are buffered up, and if there is no primary difference they are compared in reverse order. See String Compare.
In the collation table, there is a value VARIABLE_MAX. All CEs with primary weights between zero and VARIABLE_MAX are considered to be variable. (This saves having a bit per CE!) If a CE is variable, we support the Shifted Option from the UCA in constructing. We process the UCA table to ensure that VARIABLE_MAX has a zero second byte for simplicity, but a tailored or parametric Variable_Max may have two bytes. (We do impose the restriction that it can't have more.)
The quarternary is computed based on the setting. With shifted, then it is skipped if the ce is entirely zero, equal to the primary if variable, and otherwise equal to FF. (In the UCA this is FFFF, but we can make it a single-byte weight.) In either case, the quarternary is compressed (see below).
If we are not doing a quarternary level, and the CE is variable, then it is simply ignored (treated as zero). That requires some amendment to the code above, since we have to make a check before we start adding to the primary.
The code for the quarternary itself looks something like the following:
if (doQuarternary) { // fourth level
if (continuation && lastWasVariable
|| (wp1 <= variableMax1 && wp1 > 0 // exclude common cases in first test
&& (wp1 < variableMax1 || wp1 == variableMax1 && wp2 <= variableMax2))) {
*quarternary++ = wp1;
if (wp2 != 0) *quarternary++ = wp2;
lastWasVariable = true;
} else {
// do normal weights, plus
if (doQuarternary) *quarternary++ = 0xFF;
lastWasVariable = false;
}
} else {
// do normal weights
}
Note: We have to remember if the last CE was a variable, just in case we had a continuation.
In the UCA and tailoring, we have computed a case bit. This bit is set ON based on character properties, not based on the tailoring or UCA ordering. In particular, the case bit is set ON for a string if and only if there is at least one character in the NKFD form of that string which either has a lowercase, or has a corresponding small kana form. For sample code for computing that in the UCA processing, see UCA Case Bit. We precompute certain values and store them in the collation object:
ce ^= caseSwitch; wt = ce & tertiaryMask;
The caseSwitch value is set to 0x80 iff UPPER_FIRST is set in the parameters (or it may come from the tailoring table, if DEFAULT is used). It is used after the continuation bits are discarded. It thus has the effect of changing Small to Large, Large to Small, and leaving Mixed alone.
Note: There is a restriction on the effect of UPPER_FIRST. If you have contractions such as ch, Ch, hC, CH, then both the Ch and cH will be treated as Mixed, and thus be unaffected by UPPER_FIRST.
The tertiaryMask value is normally set to 0x3F, to discard the case bit in tertiary values. It is set to 0x7F when the caseLevel is OFF, and we have either UPPER_FIRST or LOWER_FIRST.
If the caseLevel is ON, then we generate an intermediate level. This is targetted at the small/large difference for Japanese. Since we know that the case occupies exactly one bit, we optimize this (at the expense of some code) by storing it bit at a time (with one bit overhead per byte). That is, 7 or fewer letters in the string only take 1 byte; 14 or fewer 2 bytes, etc. The reason we have the extra bit is so that the separator value between levels is distinguished. The code will look something like:
// when creating the level keys
int caseShift = 0;
...
if (caseLevel) {
if (caseShift == 0) {
*case++ = 0x80;
caseShift = 7;
}
case[-1] |= (wt & 0x80) >> caseShift--;
}
We will use the technique discussed in UCA to reduce the length of sort keys that contain a series of common weights in non-primary positions. This produces very significant reductions in sort key size, since most secondary, tertiary, and quarternary weights are UNMARKED. (Primary weights are also compressed: see Appendix 2 for a description of how that is done.)
We make sure there are no real weights that are both greater than COMMON (abbreviated by C below) and less than or equal to COMMON_TOP (abbr. T). Then look at a series of COMMON bytes followed by bytes H (higher than C) or L (lower than C), or nothing. Here is the ordering, and the compression that also produces that ordering. For illustration, the gap between T and has been artificially set to a small number to illustrate what happens if the normal compression range is exceeded.
| Original Bytes | Result Bytes | ||||||
|---|---|---|---|---|---|---|---|
| C | H | T | H | ||||
| C | C | H | T-1 | H | |||
| C | C | C | H | T-2 | H | ||
| C | C | C | C | H | T-3 | T | H |
| C | C | C | C | L | C+3 | C | L |
| C | C | C | C | C+3 | C | ||
| C | C | C | L | C+2 | L | ||
| C | C | C | L | C+2 | L | ||
| C | C | C | C+2 | ||||
| C | C | L | C+1 | L | |||
| C | C | C+1 | |||||
| C | L | C | L | ||||
| C | C | ||||||
To do this in code, we replace a statement like:
*secondary++ = ws;
we write:
if (ws == COMMON2) {
++count2;
} else {
if (count2 > 0) {
writeCompressed2();
count2 = 0;
}
*secondary++ = ws;
}
void writeCompressed2() {
if (ws > COMMON2) { // not necessary for 4th level.
while (count2 >= MAX_TOP2) {
*secondary++ = COMMON_TOP2 - TOP_COUNT2;
count2 -= TOP_COUNT2;
}
*secondary++ = COMMON_TOP2 - count2;
} else {
while (count2 >= BOT_COUNT2) {
*secondary++ = COMMON_BOT2 + BOT_COUNT2;
count2 -= BOT_COUNT2;
}
*secondary++ = COMMON_BOT2 + count2;
}
}
Note: if count2 > 0, then writeCompressed also needs to be called at the very end of the sortkey generation. Similar code is used for tertiaries and quarternaries, with different values for the constants. MAX_TOP is derived from the other values:
TOTAL_COUNT = COMMON_TOP - COMMON_BOT - 1
TOP_COUNT = TOP_FACTORn * TOTAL_COUNT
BOT_COUNT = TOTAL_COUNT - TOP_COUNT
The choice of how much space to put in MAX_TOP vs MAX_BOT depends on the relative frequency of H vs L bytes following. Remember that "nothing" counts as an L byte. For all of the values, see Magic Bytes.
Secondaries
In the processing of the secondaries for the Fractional UCA, we will allocate a gap of 0x80
Tertiaries
If UPPER_FIRST or LOWER_FIRST is used, we will take the 8 bits from tertiaries (including the case/continuation bits) from a CE, The top 0x40 values are continuation bits. Those are discarded, and a gap of 0x40 is used.
Otherwise, 6 bits are taken from the tertiaries (excluding the case/continuation bits), and we allocate a gap of 0x80.
Quarternaries:
Since there are never any higher bytes than FF, TOP_COUNT in that case is zero, and the code could be slightly simpler.
TOP_COUNT is computed from VariableTop. Take the first byte of VariableTop, and add 1 to it.
We can use stack buffers for almost all cases. The vast majority of sorted strings are fairly small, so we optimize for that. Here's basically how this is done..
// allocate buffers #define BUFFER_SIZE 1000 char primaryBuffer[BUFFER_SIZE]; char secondaryBuffer[BUFFER_SIZE]; char tertiaryBuffer[BUFFER_SIZE]; char caseBuffer[BUFFER_SIZE]; char quarternaryBuffer[BUFFER_SIZE]; // initialize buffers, normally to stack int max; boolean allocatedPrimary = false, allocatedSecondary = false... char* primary = *primaryStart = outputBuffer; // write into output buffer, if large enough max = getMaxPrimaryFactor() * sourceLength; if (max > BUFFER_SIZE) primary = primaryStart = malloc(max); else if (max > outputLength) primary = primaryStart = primaryBuffer; char* secondary = secondaryBuffer; max = getMaxSecondaryFactor() * sourceLength; if (max > BUFFER_SIZE) secondary = secondaryStart = malloc(max);
// tertiary, case, quarternary like secondary.
...// do code
// clean up after copying contents to output
if (primaryStart != outputBuffer && primaryStart != primaryBuffer) delete(primaryStart); if (secondaryStart != secondaryBuffer) delete(secondaryStart); if (tertiaryStart != tertiaryBuffer) delete(tertiaryStart); if (caseStart != caseBuffer) delete(caseStart); if (quarternaryStart != quarternaryBuffer) delete(quarternaryStart);
By handling it this way, we don't need to do any error checking in the loop for buffers being too small.
Once we have the level keys, we put them into the sort key. We keep separate buffers for each one. The primary buffer is initially the result buffer for the function; that saves a later copy. In the normal case, we fill each buffer, then append the secondary, tertiary, etc. to the result buffer. We normally use stack buffers for all of these; only if the result would be larger do we allocate a temporary buffer (divided into pieces internally) that we use for our results. This temporary buffer is deallocated at the end. However, in normal operation this should not occur; it is only for graceful degradation.
We support preflighting; if the result overflows the result buffer, then we set the appropriate error, but also return the length it would have had. This is done by switching to an alternate routine that just counts bytes (and adds them to the bytes we already had.
We use buffers large enough that we can avoid making multiple tests on buffer sizes per character.
Note: LEVEL_SEPERATOR is 01, not 00 (as in the previous version of ICU). All sort weights are constructed to avoid the 01 bytes so that 01 can be used as the separator. This allows the resulting sort key to be a C string (no null bytes except for a terminating null byte). This means that the sort keys can be compared with strcmp (on platforms where strcmp uses, or behaves as if it uses, unsigned bytes).
For IDENTICAL strength, we append the (normalized) string to the end of the sort key. The string is processed using a variant of BOCU (called BOSCU).
Note: The IDENTICAL strength is not recommended for general use. Some people believe that it makes the sort stable: that is a misapprehension: a stable sort is one where equal records come out in the same order as they were put in. This requires more than simply distinguishing strings that are the same for the primary, secondary and tertiary weights. A common solution is to append the record number to the sort key.
String compare uses an incremental algorithm. The goal is to minimize the work if possible, instead of building two sort keys and doing a binary compare. The latter, however, is an escape valve; for rare cases that would be complicated to support, we fall back to that.
In the old ICU we had a rather complex algorithm to do the comparison. The problem comes in with dealing with ignorables — at each level. We now simplify the algorthm. Basically, we do the following;
compare initial characters
loop through the characters in both strings until a binary difference is found
backup if we have to (see below)
loop comparing the primaries.
fetch CEs for each string successively
if either is ignoreable, stuff its CE into a buffer and fetch another.
repeat until neither is ignorable.
if the primaries are different, return the ordering
otherwise, check for END_CE (returned by FetchCE at end of string), and break out of this loop
otherwise stuff the CEs into two buffers for later, and loop to #2
loop through the secondaries in the buffers.
For French, we go in reverse order. Continuations need special processing to get the order right.
Skip any ignorable secondaries; otherwise return if there are any differences.
loop through the case bits in the buffers.
Skip any ignorable tertiary; otherwise return if there are any differences.
loop through the tertiaries in the buffers.
Skip any ignorable tertiary; otherwise return if there are any differences.
loop through the quarternaries.
compare the binary, normalized string (see Identical Strength)
In the above process, we skip any step unless the parameters call for it. So the normal case will do #1, #2, #3, #5, stopping if there is any difference. If we ever have a buffer overflow, we just bail to comparing sortkeys. (This will be extremely rare in practice). We put special terminating values in the buffers so that we can easily recognize the end.
If normalization is ON, then we will first check with checkFCD to see if we need to normalize; only if we really need to will we call the normalization process.
It pays to check for identical prefixes with a binary compare, just because it is so simple to do, and is a fairly common case. Once we find a common initial substring, we may have to backup in some circumstances to a "safe" position. Here are examples of unsafe characters, in the third column:
| Type | Example | |
|---|---|---|
| Contraction (Spanish) | c | h |
| Normalization | a | ° |
| Surrogate | <S> | |
Contractions. Suppose we have "ch" as a contraction, so that "charo" > "czar". Then when we hit the difference at "h" and "z" we can't just compare the rest of the strings "haro" and "zar": we would get the wrong answer. We have to backup by one. We have to repeat the test, because the position we backup to may also be in the middle of a contraction. So the way we do this is to flag every character that is a trailing character in some contraction. (A trailing character is any but the first).
Normalization and Canonical Order. To get to a safe position, we backup if the character after the pointer is a MAYBE in the NFC QuickCheck table, or has non-zero combining class.
Surrogates. Since supplementary characters are exceedingly rare, to simplify our processing we will simply backup if the code unit after the pointer is either a Lead or Trail surrogate: anything in D800..DFFF. To avoid clogging the unsafe table, this will just use a range check, and no supplementary characters will be in the unsafe table.
Because we will not normally need to backup and the whole purpose of the initial check is for performance, we don't waste time trying to find out exactly the circumstances where we need to backup. So we can err on the side of backing up more than we theoretically have to, to reduce the computation. We just keep a simple bit table in the tailoring table. This table contains a set of flags that mark a code unit (UTF-16) is flagged as "unsafe".
To make the table smaller, we actually just hash codes into the table. This may cause us to back up slightly more often, but that will not be a correctness issue. We do this if the character is marked as "unsafe" either in the UCA or in the tailoring. (We might later copy the contents of the UCA into the tailoring table.) The first 256 bits are reserved for Latin-1, to avoid collisions.
A collation table, whether UCA or a tailored table, contains the following subtables. It is in a flattened form that can be loaded as a DLL or into read-only shared memory. The header contains offsets to all the other subtables, plus the number of items in each subtable, plus various information that is used in processing, such as the maximumPrimaryFactor. A CE is a uint32_t. A UChar is 16 bits.
Note: we make the processing faster by having offsets everywhere in the table be from the very start of the whole table, not from the start of the each subtable!
The order of the subtables is not determined, with one exception: because the offsets to the Expansion table have only 20 bits, we put that one first. The position of each table is determined by looking up the offset in the header.
|
|
|
|
|
|
Each of these tables is described in more detail in the section of Fetching CEs that deals with them.
getCE is a function that returns a single CE, based on a source buffer. It is used by the sort-key generator, for the incremental string comparison, and by the public iteration API. There is a parallel version that fetches CEs going backwards through a string. That version is used in the fast Boyer-Moore international search algorithm.
Because two strings are "live" at the same time in comparison, we will pass in a parameter block (allocated on the stack) with state
information for each string to getCE. This is called a Context:
| source | The source character position |
|---|---|
| sourceEnd | To know when to end |
| ceBuffer | For CEs that are results of expansion |
| ceBufferStart | The start index for CEs in the buffer. |
| ceBufferEnd | The limit index for CEs in the buffer |
| isThai | Have we encountered Thai pre-vowel? |
For each string, we keep a ceBuffer for expansions. This is a FIFO queue, allocated on the stack. It is large enough to handle all reasonable expansions (e.g. up to 100). We will not build longer expansions in the tables so we never need to check for overflows.
There are two pointers: ceBufferStart and ceBufferEnd that point to the contents. The function is demonstrated below.
For compliance with UCA, we have to get the same results as if the text is in NFD. However, in the majority of cases, we manage this without the performance cost (or extra buffer management) of actually doing the NFD conversion.
In the normal case, we fetch characters directly from the source buffer. This is the case either if normalization is OFF, or if we had passed the CheckFCD test. (Clients of the API should only turn normalization off if all the strings are guaranteed to be in FCD.)
This testing is done at the very beginning of the routine, not within any loops. If we have to normalize, then we do so with NFD into a stack buffer (if possible). If too big to normalize into the stack buffer, we allocate a temporary buffer. This allocation should be rare, and will thus not be a performance issue. In either case, we reset the source pointer to point at the normalized text, so we do no extra work within the loops.
// initialize
uchar* sourceBuffer [BUFFER_SIZE];
uchar* source, *sourceStart;
source = sourceStart = inputBuffer;
if (normalization_on && !checkFCD(inputBuffer, inputLength) {
// normalize into sourceBuffer if possible, resetting source, sourceStart
// if too big, allocate memory, resetting source, sourceStart
}
uchar* sourceEnd = source + sourceLength;
....
// cleanup
if (sourceStart != inputBuffer && sourceStart != sourceBuffer) delete(sourceStart);
So what is checkFCD? We will start with a couple of definitions.
Define the raw canonical decomposition (RCD) of a string to be the result of replacing each character by its canonical decomposition, without canonical reordering.
Define a string to be in fast C or D (FCD) if its raw canonical decomposition is of form NFD.
Examples:
| X | FCD | NFC | NFD | Comments on FCD |
|---|---|---|---|---|
| A- ring | Y | Y | ||
| Angstrom | Y | RCD(X) == A + ring == NFD(X) | ||
| A + ring | Y | Y | ||
| A + grave | Y | Y | ||
| A-ring + grave | Y | |||
| A + cedilla + ring | Y | Y | X == RCD(X) == NFD(X) | |
| A + ring + cedilla | X == RCD(X) != NFD(X) == A + ring + cedilla | |||
| A-ring + cedilla | Y | RCD(X) == A + ring + cedilla != NFD(X) |
Note that all NFD strings are in FCD, and in practice most NFC strings will also be in FCD; for that matter most strings (of whatever ilk) will be in FCD.
We guarantee that if any input strings are in FCD, that we will get the right results in collation without having to normalize. We can do this because we do a canonical closure both in our Fractional UCA table and in our Tailoring table, and we handle Hangul decomposition algorithmically in Hangul Implicit CEs. So any composite automatically expands as the correct series of CEs. If the string is FCD, then this expansion will be in the right order and everything is hunky-dory, even without normalization to NFD.
Luckily a test for FCD is even faster and more precise than Normalization QuickCheck. We have a static FCD data table that is precomputed to map each Unicode code point X to the pair of combining classes of the first and last characters in the canonical decomposition for X. This table is constructed as a standard UTrie. Then the code is something like:
boolean checkFCD(uchar *p, uchar* endp) {
uint16_t fcd;
uint8_t prevCC;
uint8_t cc;
prevCC = 0;
while (p < endp) {
fcd = FCD(*p++);
cc = getLeadCC(fcd);
if (cc != 0 && prevCC > cc) return false;
prevCC = getTrailCC(fcd);
}
return true;
}
When we are dealing with null-terminated strings, we have to make a pass through the characters anyway to find the length so that we can set up the buffers properly. These two checks are combined into a single routine for performance.
Note: The Unicode 2.1.9 UCA tables do not contain canonical closures. Formally speaking, NFD must be used for compliance to UCA, but the table also contains specially constructed CEs for characters that have canonical decompositions. These special CEs are useful in certain environments where the input format of strings is guaranteed to be constrained, and also supply a certain degree of compression. However, they handle a narrower range of strings without normalization; the compression is smaller than what we get with other techniques; and most importantly, strings produced with normalization OFF are not comparable to strings produced with it ON (unlike with canonical closures). The ISO 14651 tables do contain the canonical closures, as will the UCA tables in the future.
If the CE is of the form Ftyyyyyy, then it has a special interpretation. For specials, t is used as a switch, and yyyyyy is an offset. By choosing this value, and making this range adjacent to the NOT_FOUND marker, we save on switches. The following is a pseudocode sample of how this would work:
int getCE(...) {
// get it out of buffer, if available
if (ceBufferStart < ceBufferEnd) {
ce = *ceBufferStart;
if (ceBufferStart == ceBufferEnd) {// reset!
ceBufferStart = ceBufferEnd = ceBuffer;
}
// return if done
if (source >= sourceEnd) return EOS;
// get character, and do simple mapping
ch = *source++;
if (ch < 0xFF) {
ce = tailoredData[ch]; // Latin1 is always here!
} else {
ce = tailoredData[tailoredIndex[(ch >>> 8)] + (ch & 0xFF)]; // trie
}
if (ce >= NOT_FOUND) { // NOT_FOUND or SPECIAL
if (ce > NOT_FOUND) { // handle special casing
getSpecialCE(tailoredSpecials, ...);
}
// if still not found, then try in the main table
if (ce == NOT_FOUND) {
ce = UcaData[UcaIndex[(ch >>> 8)] + (ch & 0xFF)]; // trie
if (ce > NOT_FOUND) {
getSpecialCE(UcaSpecials, ...);
}
if (ce == NOT_FOUND) {
// make artificial CE from codepoint, as in UCA
}
}
}
return ce;
}
const int NOT_FOUND = F0000000;
Note: NOT_FOUND is higher than all non-SPECIAL CEs, and less than all non-specials.
Note: every tailoring table is built to have all Latin1 characters, even when they are identical with the UCA table. That way the Latin1 case is as fast as possible.
In the case that we do have specials, it falls into certain cases: Contraction, Expansion, Thai, Charset, and Surrogate. For processing these, we would do something like the following pseudocode:
while (true)
// special ce is has these fields:
// first nybble (4 bits) is F, next nybble (4 bits) is type
int type = (ce >> 24) & 0xF;
// next 24 bits are data
int data = ce & 0x00FFFFFF; // remove F, type
switch (type) {
case NOT_FOUND_TAG: break;// never happens
case THAI_TAG: // do Thai, Lao rearrangement
...
case CONTRACTION_TAG: // do contraction thing
...
case EXPANSION_TAG: // do expansion thing
// put extra CEs into ceBuffer
...
case SURROGATE_TAG: // post 1.8
//use offset, ch and *source to for trie with dataTable.extendedData
...
case CHARSET_TAG: // (post 1.8)
// do
ce = (ce << 8) | 0x0303; // charsets only used for primary differences, so use middle 16 bits
// the 0303 is to make a well-formed CE.
charConverter[charSetNum].getOrdering(ch, ceBuffer, ceBufferTop);
break;
}
if (ce <= NOT_FOUND) break; // normal return
}
The expansion table is simply a list of CEs. Internally it is broken into sections. The longer ones are null terminated: the others use an external length, based on the data from above.
|
Expansions |
|---|
|
CE |
|
CE |
|
CE |
|
.... |
The data is broken into two pieces: 4 bits for length, 20 bits for offset. A length value of 0 means that the actual length didn't fit in 4 bits, and the expansions are instead terminated by 00000000. Otherwise, the length is used to determine the number of CEs to add to the ceBuffer. E.g.
len = ce & 0xF;
offset = ce >> 8;
if (len == 0) // go until terminated
ce = expansionTable[offset++]; // get first one. Never 0
loop {
item = ExpansionTable[offset++];
if (item == 0) break;
ceBuffer[ceBufferTop++] = item;
}
} else {
ce = expansionTable[offset++]; // get first one.
for (int i = len-2; i > 0; --i) {
ceBuffer[ceBufferEnd++] = ExpansionTable[offset++];
}
}
|
Important: when processing backwards (e.g. for French secondaries), expansion CEs have to be fed out
backwards as well. This is where the continuations are important: unlike the others they are not reversed. That is, if an expansion consists
of |
The contraction tables consist of two parts, one 16 bits wide (uchars) and the other 32 bits wide (CEs). It uses two separate arrays instead of an array of structs to avoid alignment padding (this is also a far smaller footprint in the Java version!!). The first uchar in each section is actually a delta offset, not a uchar. It is to be added to the current position in the table to get to the real offset.
|
|
||||||||||||||||||||||||||||||||||||||||||
From the original CE, we use the data as an offset into the Contraction UChars table. If backwards is on (a programmatic setting for searching), we add the backwards offset delta to get a different backwards table, otherwise we advance one. We grab a character from the source. We search the characters (which are in sorted order). If a target char >= source char, return the defaultCE (which may be expansion). If target char == source char, get the corresponding result. If that result is a contraction, grab another character, extract the offset, jump to the new section of the table and keep looping. Otherwise return that result (may be expansion).
The first line of each of the subtables has a special meaning. The contraction uchars values split into two 8 bit values, and used for discontiguous contractions.
max-cc: the maximum canonical combining class found in the table (8 bits).
all-same: true (FF) if all of the non-zero canonical combining classes are identical.
defaultCE: two different values
if the table is pointed to directly from the main data table, it is the CE we would have gotten if we had not had a contraction.
otherwise, it is NOT_FOUND.
We have to be careful of one special case. Suppose that JKL and KM are both contractions. When processing JKM, when we fail to find JKL, we need to be able to back up so that we can correctly return CE(J), CE(KM). When we hit NOT_FOUND, we unwind back to the first character, return the defaultCE for that case, and continue after it.
Sample pseudo code:
// only do backwards check first time. Cast to signed int delta if we are.
if (backwardsSearch) offset += contractionUChars[(int16_t)offset]; else ++offset;
// loop over multiple contractions
while (true) {
if (source >= sourceEnd) {
contractionUChars[--offset]; // return default if end of string
break;
}
uchar schar = source++;
int startPosition = offset;
uchar tchar;
while (schar > tchar = contractionUChars[offset++]); // loop til found or done
if (schar != tchar) offset = startPosition - 1; // use default if not found
ce = contractionResult[offset];
if (ce < LOWEST_CONTRACTION) break;
offset = ce & 0x00FFFFFF; // get new offset and keep looping
}
// we've broken out of the loop
if (ce < LOWEST_EXPANSION) return ce;
else // do expansion thing
We know the inner loop terminates, since we always end each list of chars with FFFF. If the user happens to use a malformed string containing FFFF, we are still safe, since we store defaultCE in the corresponding result position.
There is a further complication for contraction. Suppose that a + ring is a contraction, sorted after z. As specified in UCA, adding additional marks, such as underdot, should not cause the a+ring to revert to sorting after 'a'. While in practice this does not occur often, it does happen in some languages. That means that just as in NFC, we have to handle cases of contraction across other combining marks. However, since this will be very rare, we do not want to degrade the normal performance because of it. Here are some sample cases for comparison, all supposing that a + ring is a contraction:
a + b
a + ring + b
a + underdot + b
a + grave + b
a + underdot + ring + b
a + ring + grave + b
Case 1 and 2 are the ones to particularly focus on for performance. Here is how we do it. Since the input is normalized (either because we normalized to NFD, or because the user knows that the text is already FCD (see CheckFCD), we know that any combining marks are in canonical order. We only have to worry about the ones that have non-zero combiningClass.
If all of the following conditions are true for a character X (which has canonical combining class CCX), then we call a special routine.
max-cc is non-zero (indicating that there is a combining mark in the table that we might have to skip)
CCX != 0 (since otherwise there can be no intervening combining marks)
CCX > max-cc (since if it were smaller, it would not be in normalized order -- if it were the same, it would be blocked).
Note: The special routine will be called so seldom that it does not have to be highly optimized!
It will do additional checks by looking ahead character-by-character to see if one matches the contraction table. It stops with a failure if any of the subsequent characters have a canonical combining Class of zero. If there is a match in the table, it checks to make sure that the previous character does not block the match (that happens if the previous character has the same combining class as the match). If the whole contraction matches, then special handling is invoked, since we have to (logically!) remove the intervening characters that were found! That is, suppose we have pqrstu, where prt is a discontiguous contraction. Then the result should be CE(prt), CE(q), CE(s), CE(t). That means that after prt is processed, we then have to process the characters that come between.
Note: UCA only skips combining marks between elements of a contraction if they are trailing. It will not, for example, match a + b against a + ring + b. One cannot simply match the a + b and act like the ring follows, since that would not distinguish it from a + b + ring. In those instances, one would need an explicit contraction in the table for a + ring + b.
Certain Thai, Lao character rearrange (see UCA). In UCA if x is a Thai vowel, "xyz" should behave as if it were "yxz". To avoid overhead of testing for character classes, we give all the rearranging characters a Special class. At the very beginning, we turn Thai processing ON.
If Thai processing is ON, and we hit a Thai vowel, we backup by one source character, and copy the source buffer (if it is not our own private buffer) to a writable buffer. We then pass through all the remaining characters, and rearrange any Thai ones. We turn Thai processing OFF for the rest of the string, and return a zero CE (ignorable).
When Thai processing is OFF, we use the data as an offset into the Expansion table. We fetch exactly 1 element, and process it (looking for specials, so it can be an expansion or contraction).
Surrogates can be handled already using contractions, but this allows us the freedom to add an extra table for Unicode 3.1, when someone might want to add tens of thousands of surrogates. For such a case, we will have an optimized table. Essentially, what we do is add the following tables.
|
|
Fetch the next source character. If it is not a surrogate, backup, return a 0 CE (completely ignorable).
Otherwise get the bottom 10 bits of that next source character. Perform the normal trie operations: take the top few bits and add them to data. Use that to lookup in the SurrogateTrieIndex, and find an offset. Add the bottom few bits to that, and use that to index into the SurrogateTrieData to get the CE.
If that CE is an expansion or contraction, handle those cases, otherwise just return.
If a character is not explicitly in the UCA table, then it is assigned an implicit CE which is generated from the code point. Implicit CEs are used for all characters that are not explicitly in the UCA table. Because of the way UCA is defined, these have to be in two groups, with different lead bytes. CJK Ideograph implicits are assigned with lead bytes from E8 to EB; other (including unassigned characters and Yi) are assigned with lead bytes from EC to EF.
Note: Hangul implicit CEs are generated specially: see Hangul Implicit CEs!
Note: As per UCA 7.1.1 lllegal code points, all values xxFFFE and xxFFFF are ignored, as are all unpaired surrogates. The code to do this (after fetching the complete code point) is:
if ((codePoint & 0xFFFE) == 0xFFFE || (0xD800 <= codePoint && codePoint <= 0xDC00)) { return 0; // illegal code value, use completely ignoreable! }
In these generated CEs there is always a gap at least one (for possible later insertion), and with no zero bytes in the primaries. The implicit CEs use an optimized form that uses primary weights from Dxxx to EFFF. Basic code points (0000..FFFF) get a 24-bit primary weight, while supplementary code points (10000..10FFFF) get a 32 bit primary weight. Since the latter will be rare, this does not represent a performance issue.
Note: These values are generated on the fly, not stored in the tables. They are only generated if there is no explicit table entry for the code point!
Distribute the bits as follows. The resulting primary uses 3 bytes in sort keys, and has a secondary and tertiary of 03 (UNMARKED).
CP = xxxxyyyy yyyzzzzz
CE1 = 1101xxxx 1yyyyyyy 00000011 00000011 // normal
CE2 = zzzzz100 00000000 00000011 10000011 // continuation
The primary gap is larger than one, which allows more elements to be inserted (in tailoring) without using extension CEs. Suppose that we have a Basic CP with qqqqq is zzzzz + 1; then here are the possible insertion values, marked with *. (The values at ** can be used iff qqqqq != 0.)
zzzzz100 zzzzz101 * zzzzz110 * zzzzz111 * qqqqq000 ** qqqqq001 ** qqqqq010 * qqqqq011 * qqqqq100Note: Hangul Syllables are handled specially. If normalization is on, they decompose to Jamo and are treated normally. If normalization is off, then they fall through to the Implicit CE generation. The implicit CEs are generated as above, but then they are shifted to be in the Jamo range. This provides for compatibility between normalized and unnormalized text.
First subtract 10000 from CP, then distribute the 20 remaining bits as follows. The tertiary is UNMARKED. The resulting primary uses 3 bytes in sort keys.
CP = wwww xxxxxxxy yyyyyyzz
CE1 = 1110wwww xxxxxxx1 00000011 00000011 // normal
CE2 = 1yyyyyyy zz100000 00000000 10000000 // continuation
CE1 = 0xE0010303 | (CP & 0xFFE00) << 8;
CE2 = 0x80200080 | (CP & 0x001FF) << 22;
There's a large gap for customizing.
Except for CJK and Hangul, this results in sort keys that are 1 byte shorter per basic code point than what is described in UCA. The basic CJK and Hangul code points do take 1 byte longer per code point in sort keys than in UCA, but (a) UCA does not allow for tailoring relative to implicit code points without moving the code points, and (b) all of the CJK countries typically have explicit mappings for the characters they care about, which will reset them down to 2 bytes in those cases.
If there is a specific position set for [undefined] (see rule syntax), one that is not at the end of the file, then the weights are computed differently. The length of bytes required depends on the size of the gap where the Undefined items are positioned. For example, suppose there is a gap of only 1 where [undefined] is inserted, so that elements all start with a 16 bit primary pppppppp pppppppp. Here is how they would be generated:
With a gap of one, the resulting primary occupies 5 bytes in sort keys:
codepoint = wwyyyyyy yzzzzzzzCE1 = pppppppp pppppppp UNMARKED UNMARKEDCE2 = 111100ww 1yyyyyyy 1zzzzzzz 11110000
With a gap of one, the resulting primary occupies 6 bytes in sort keys. (again, subtracting 10000)
codepoint = vvvw wwwwwwyy yyyyyzzzCE1 = pppppppp pppppppp UNMARKED UNMARKEDCE2 = 11111vvv 1wwwwwww 1yyyyyyy 11110000CE3 = 1111zzz0 00000000 00000000 00000000
However, if the gap is at least 8 at the Undefined position, then effectively 3 bits from the first primary can be stolen, and all values would take 5 bytes in the sort key.
Hangul Implicit CEs are handled specially. For everything but Hangul, our UCA table and Tailoring table generation guarantees that if you have FCD text, you get the same results if you turn off decomposition in the algorithm. (See CheckFCD.) However, Hangul syllables still need to be decomposed. So if we do not normalize the text, then the Hangul Syllables will fall through to the Implicit CE phase. At that point, we will do a quick expansion, as in the following pseudocode.
Note: The JamoSpecial flag in the following code comes from the tailoring table. See Hangul Building for more information on how this is built in the UCA table and in the tailoring tables.
const int
SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
LCount = 19, VCount = 21, TCount = 28,
NCount = VCount * TCount, // 588
SCount = LCount * NCount, // 11172
LLimit = LBase + LCount, // 1113
VLimit = VBase + VCount, // 1176
TLimit = TBase + TCount, // 11C3
SLimit = SBase + SCount; // D7A4
// once we have failed to find a match for codepoint cp, and are in the implicit code.
unsigned int L = cp - SCount;
if (cp < SLimit) { // since it is unsigned, catchs zero case too
// divide into pieces
int T = L % TCount; // we do it in this order since some compilers can do % and / in one operation
L /= TCount;
int V = L % VCount;
L /= VCount;
// offset them
L += LBase;
V += VBase;
T += TBase;
// return the first CE, but first put the rest into the expansion buffer
if (!context->JamoSpecial) { // FAST PATH
pushOnExpansionBuffer(UCAMainTrie(V));
if (T != TBase) {
pushOnExpansionBuffer(UCAMainTrie(T));
}
return UCAMainTrie(L); // return first one
} else { // Jamo is Special
// do recursive processing of L, V, and T with fetchCE (but T only if not equal to TBase!!)
bufferEnd = tempBuffer;
*bufferEnd++ = L;
*bufferEnd++ = V;
if (T != TBase) {
*bufferEnd++ = T;
}
return fetchCE(... tempBuffer, p, ...); // loops, stuffing remaining CEs into expansion buffer.
}
}
Note that because of this special processing of Hangul Syllables, we do not allow contractions between Hangul Syllables and other characters. That is, you can't have the contraction like:
& z < 각x
To save space, we can use Charset Ordering. This is to account for the case where CJK characters are essentially just sorted in character set order, e.g. by JIS order. To do this, we would add functions to character set converters, as described in the API section.
ScriptOrder uses an optional array to reorder the top bytes of primary keys. A valid ScriptOrder array must map 00 to 00, and Fx to Fx, and variable primaries to variable primaries. Other bytes it is free to rearrange, but the result must be a permutation. This works by making sure that scripts do not share a first primary byte (see #UCA Processing).
Note: Script Ordering is not applied in continuations.
The flat file structure is very similar to what is describe in UCA, with extensions based upon the discussion here. Using a flat file allows us to dramatically decrease initialization time, and reduce memory consumption. When we process tailoring rules, we have an internal format that is very much like the UCA data. We will have a function that writes out that format into a flat file. Note that this is not an area where we will spend much time on performance, since 99% of the time this is done at ICU build time.
When we build a tailoring, we make the following modifications to the current code. The current code builds an ordered list of tokens in the tailoring, where each token is an object containing the character(s), an indication of the strength difference, plus a special field for contracting characters. Once it is done, it assigns CEs to those characters based on the ordering and strength, putting the CE's into a trie table plus data structures for expanding, contracting, etc. Instead of this, we will build multiple lists, where each list is anchored to a position in the UCA.
We will also allow "& X < Y", where X is not explicitly defined. (In ICU 1.6 and previous, this is disallowed). In that case, Y will give a value that is based off of the implicit value that X would have. In processing the rules in the implementation, we may give X the implicit value in the intermediate data structure, then remove it when we finally store into the flat file to save space.
Since we are tailoring the UCA, we could have call to insert elements before a given UCA element. Currently, that would have to be done by finding the previous element in the table, and inserting after. That is, to insert x with a primary difference before 'a', we have to know that '9'' is before it in the UCA, and say &'9' < x. However, this isn't very robust, so we add extra syntax for it. See Rule Syntax.
Once the tailoring table is completely built, we will add any UCA characters in the range 0..FF that are not there already. At a slight increase in table size, that guarantees the minimal code path for many cases. We will also close the file under canonical decomposition, so that turning off decomposition can be done for performance (in some cases). We will close the file under compatibility decomposition, so that when you tailor a letter, the compatibility characters containing it are also moved. E.g. if we put & z < i, then the "fi" ligature will then sort after "fz".
We set JamoTailored in the UCA table structure to true if any of the characters (U+1100..U+1112, U+1161..U+1175, U+11A8..U+11C2) are contractions or expansions. JamoSpecial in the tailoring tables is set to true iff JamoSpecial is set in UCA OR any of those characters are tailored in the tailoring rules. 99% of the time, the Jamo will not be special, and we can just get the values directly from the UCA Main Trie table. Only if the Jamo are special do we have to use the more expensive lookup.
ICU 1.6 and below used direct insertion for tailoring, as shown below. ICU 1.8 uses postpone insertion, as shown below. With postpone insertion, rules
| Source | Result | |
|---|---|---|
| UCA | ... x <<< X << x' <<< X' < y ... | |
| Direct Insertion | & x < z | ... x < z <<< X << x' <<< X' < y |
| Postpone Insertion | & x < z | ... x <<< X << x' <<< X' < z < y |
The reason we didn't originally use postpone is that it is relatively easy to emulate. For example, in the above case, & X' < z with direct insertion is the equivalent of & x < z with postpone insertion. However, it is not easy to emulate direct insertion with postpone insertion: in the above case you would have to use & x < z <<< X << x' <<< X' to emulate &x < z. However, postpone insertion is probably more intuitive, is more stable under new versions of UCA, and works better with the large number of variant characters in UCA. In practice, given the tailoring rules from the old ICU, this will actually produce results that more compatible than if we retained direct insertion.
The API for get rules will return just the tailoring rules (we store this with the flat file). The API for get full rules will get the UCA rules (we generate and store this with the UCA flat file), then appends the tailoring rules.
The following describes the CE generation process in more detail. In all of the following discussion:
For now, the discussion will only cover the positive directions, since we are only doing that in 1.8.
We will generate the following internal structures. (The CEtoStr is generated from the UCA, and stored in a flat file).
|
|
|
| |||||||||||||||||||||||||||||||||||||||||
| list of list of tokens | header for token list | tokens, mapping to chars |
mapping to: next, previous CEs, characters between CEs | |||||||||||||||||||||||||||||||||||||||||
Notes:
Each FCE is stored as two Unflagged CEs, a and b. If there is no need for a continuation, CEb is zero. We add two terminating elements, the lowest possible CE and greatest possible CE at the start and end of the table to ensure we have bracketing values. To find next and previous, we take baseCE, and binary search for it in this table. There is a fast binary header, supporting completely unrolled binary search.
StrCRep is either a codepoint, or if the top byte is non-zero, is a StrRep pointing into StringContinue. If one CE corresponds to two character strings, both are listed in StringContinue, separated by FFFF. If a string has an expansion, that uses FFFE as a delimiter. Example:
CE_Reverse
At the end of this process, we will end up with a list of lists of tokens.
We will then close the sets of tokens with the canonical closure and Latin-1 characters,
To do the next phase, we need the following CEs from the Base Collation Element:
[13 . 06 . 09] <= BCE
...
[13 . 06 . 09] <= nextCELow3
[13 . 06 . 0B] <= nextCEHigh3
...
[13 . 06 . 1F] <= nextCELow2
[13 . 07 . 03] <= nextCEHigh2
...
[13 . 0B . 03] <= nextCELow1
[15 . 03 . 04] <= nextCEHigh1
These are determined as follows from the CEtoStr table.
From BCE, find the first subsequent CE that is tertiary-greater. That becomes nextCEHigh3. nextCELow3 is simply the CE immediately before nextCEHigh3.
...
From BCE, find the first subsequent CE that is primary-greater. That becomes nextCEHigh1. nextCELow1 is simply the CE immediately before nextCEHigh1.
From BCE, find the first previous CE that is tertiary-less. That becomes previousCELow1. previousCEHigh1 is simply the CE immediately after previousCELow1.
...
Note that ranges may be identical. Thus nextCEHigh1 might be the same as nextCEHigh2. Since nextCELow is determined by nextCEHigh, if the High's are the same than the Low's are the same.
If the ranges overlap, then all the items are inserted at the higher level. Thus if nextCEHigh2 == nextCEHigh3, then we insert "<<< x <<< y << z <<< w << u" between nextCELow2 and nextCEHigh2. There are still only 2 key values.
How do we find the number of items to squeeze in at each level? We fill in the CountN fields in each token, working backwards. Any time we go up a to a higher (i.e., lower numbered) level, we reset the lower counters to 1. Here is an example of a series of tokens, and their levels and counters after we have done this process. (We omit the identical cases to make the diagram simpler).
| L1 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| L2 | 4 | 3 | 2 | 1 | 2 | 1 | 5 | 4 | 3 | 2 | 1 | 3 | 2 | 1 | |||||||||||||||||
| L3 | 3 | 2 | 1 | 1 | 2 | 1 | 2 | 1 | |||||||||||||||||||||||
| A | B | C | |||||||||||||||||||||||||||||
We now know at each item how many remaining items of the same weight there are. And we know the bounds: we put the A items between nextCELow3 and nextCEHigh3, the B items between nextCELow2 and nextCEHigh2, and the C items nexCELow1 and nextCEHigh1. For any items in C at levels 2 or greater, or any items at B of level 3, the bounds are for those levels are 02..FF.
To aid in the production, we have three weight generators. A weight generator handles the complications of figuring out the optimal weights as in Intermediate CEs below. When we reset a weight generator, we tell it the low and high bounds for the weights, and the number of items. It will then feed out nextWeight values successively.
As we pass through the tokens from start to end, we look at the current token and the one after it (if any). Based on that, we reset the weight generators if we are about to go down in level. We then ask the three weight generators for the weights for each level of the current token.
How do we determine how to allocate a number of weights, given a lower and upper bound?
First, normalize the bounds to 32 bits, left shifted. Then determine the following counts (using an example):
| lowBounds | 36.42.15.00 | |
| firstLow | 36.42.16.00 | lowCount = FF - 16 + 1 |
| lastLow | 36.42.FF.00 | |
| firstMid | 36.43.00.00 | midCount = 59 - 43 + 1 |
| lastMid | 36.59.00.00 | |
| firstHigh | 36.5A.02.00 | highCount = 08 - 06 + 1 |
| lastHigh | 36.5A.06.00 | |
| highBounds | 36.5A.08.00 |
Either low or mid may not exist (e.g. if lowBounds and highBounds are the same length, and have the same non-final bytes. Because of the way we construct the UCA, we are guaranteed that highCount is at least 1.
Note: When incrementing primary values, we will not cross high byte boundaries except where there is only a single-byte primary. That is to ensure that the script reordering will continue to work. If such a boundary falls in the middle of first, mid or last, then truncate it (choosing the longer range).
Set slot = low, mid, or high depending on which has fewer bytes (and exists). If slotCount is large enough for targetCount, then allocate the values there.
Otherwise, loop, adding 1 to slotLen, and multiplying slotCount by 254 (the number of allowable bytes, since 00 and 01 can't be used. Continue until slotCount >= targetCount. Then find out how many values can be done with one fewer bytes:
shorterCount = floor((slotCount - targetCount)/254)
slotCount -= shorterCount.
Once we have this, allocate shorterCount items within slot with each item having a byte length of (slotLen - 1), then the remainder of the items with byte lengths of slotLen. Example: suppose we have 7 items, to squeeze into a gap of 5 with a length of 2. Then we can fit 4 in with 2 bytes, and the remaining 3 items will take 3 bytes.
Note: while it is possible to produce slightly more optimal weightings, we won't spend the time necessary to compute them.
After we have assigned all of the CEs for the source parts of the tokens, we make a second pass through and find the list of CEs for all expansions. We do this in two passes since "a/e" will depend on the final CE value for "e", which may be changed by an element after it.
Since the case bit in CEs is character based, it does not depend on the tailoring rules at all. It is basically a post-processing step. That is, in any tailoring rule that results in "xyz" beings given collation elements C1, C2...Cn, each of the characters x, y, and z are checked by looking them up in the UCA table. If any one of them has a first CE that has the case bit on, then the case bit is turned on in C1..Cn.
The UCA data table only specifies relative ordering among weights. We are free to redistribute the weight values as long as the relative ordering is the same. To make our processing more efficient, decrease the sort-key length for A-Z, allow script-ordering, provide for tailoring of the read-only table, we will preprocess the data so that we:
Add gap of at least 1 between all weights at each level (allows tailoring).
Allow no bytes of 00, 01, 02. See Magic Bytes.
Set the following primaries to have odd single-byte primaries (e.g. 3100, 3300, 3500...) for compression (they are odd to allow gaps for tailoring).
Space
Latin a-z
Start at 04 for all weights. See Magic Bytes.
For secondaries and tertiaries, have a large gap between UNMARKED and other values, to allow for UCA-style run-length compression.
Leave gaps at the start and end of each byte range, to allow for tailoring and future UCA values.
Make sure that all primaries are less than [top]. See Magic Bytes.
Drop all "artificial secondaries" introduced for canonical decomposables, then pack secondaries, starting at UNMARKED. (so we can use fewer bits for secondaries)
Start different scripts on 256 bounds (to let us shuffle scripts). Scripts are determined by the scripts of letters in ScriptNames.txt, except that variables are treated as a separate script.
We generate the case bit, as described below.
As in Tailoring, we form the canonical closure of the table. That is, if a character X has a canonical decomposition Y, then we add the mapping from X to the CEs that correspond to the characters in Y. This allows us to accept all FCD characters without normalization processing (See CheckFCD).
|
Note: We already allow for both normalized and non-normalized collation in ICU 1.6 (and earlier). In building data tables: (a) all rules are normalized to NFD before the characters are added to the table, (b) after all rules have been added, then all decomposables (except Hangul) are added as expansions. Step (b) is equivalent to adding all the following rules (with normalization off):
This will get the correct answer whether or not the source text for any sort key is normalized or not, unless the text contains characters that are not in canonical order. So for the locales that would really be affected by this (Greece, Vietnam), we turn on normalization by default. |
A draft processed UCA table is found in Appendix 3: Data Files. For information on the fractional_ce, see Fractional Collation Elements.
Once we process the UCA data, we write it out in a compact, efficient form using the function described in Flat File.
Here is sample code for computing the case bits. Once this is in the UCA, then that very data is used in tailorings: see Tailoring Case Bit.