When collating or matching text, a number of attributes can be used to affect the desired result. The following describes the attributes, their values, their effects, their normal usage, and the string comparison performance and sort key length implications. It also includes single-letter abbreviations for both the attributes and their values. These abbreviations allow a 'short-form' specification of a set of collation options, such as "UCA4.0.0_AS_LSV_S", which can be used to specific that the desired options are: UCA version 4.0.0; ignore spaces, punctuation and symbols; use Swedish linguistic conventions; compare case-insensitively.
A number of attribute values are common across different attributes; these include Default (abbreviated as D), On (O), and Off (X). Unless otherwise stated, the examples use the UCA alone with default settings.
| Attribute | Ab. | Possible Values | Description | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Locale | L | <locale> | The Locale attribute is typically the most important
attribute for correct sorting and matching, according to the user
expectations in different countries and regions. The default UCA ordering
will only sort a few languages such as English and Italian correctly
("correctly" meaning according to the normal expectations for
users of the languages). Otherwise, you need to supply the locale to UCA
in order to properly collate text for a given language. Thus a locale
needs to be supplied so as to choose a collator that is correctly tailored
for that locale.
The choice of a locale will automatically preset the values for all of the attributes to something that is reasonable for that locale. Thus most of the time the other attributes do not need to be explicitly set. In some cases, the choice of locale will make a difference in string comparison performance and/or sort key length. In short attribute names, L<language>_<region>_<variant> is represented by:
If no language, region, or variant is selected, the collator will use the default UCA ordering. Locale Explorer shows the language, regions, and variants that ICU supports, and provides a demo of how they will differ in terms of sorted output.
|
|||||||||
| Strength | S | 1, 2, 3, 4, I, D | The Strength attribute determines whether accents or case
are taken into account when collating or matching text. ( (In writing
systems without case or accents, it controls similarly important
features). The default strength setting usually does not need to be
changed for collating (sorting), but often needs to be changed when matching
(e.g. SELECT). The possible values include Default (D), Primary (1),
Secondary (2), Tertiary (3), Quaternary (4), and Identical (I).
For example, people may choose to ignore accents or ignore accents and case when searching for text.
Almost all characters are distinguished by the first three levels, and in most locales the default value is thus Tertiary. However, if Alternate is set to be Shifted, then the Quaternary strength (4) can be used to break ties among whitespace, punctuation, and symbols that would otherwise be ignored. If very fine distinctions among characters are required, then the Identical strength (I) can be used (for example, Identical Strength distinguishes between the Mathematical Bold Small A and the Mathematical Italic Small A. For more examples, look at the cells with white backgrounds in the collation charts). However, using levels higher than Tertiary — especially the Identical strength — will result in significantly longer sort keys, and slower string comparison performance for equal strings.
|
|||||||||
| Case_Level | K | X, O, D |
The Case_Level attribute is used when ignoring accents but not case. In such a situation, set Strength to be Primary, and Case_Level to be On. In most locales, this setting is Off by default. There is a small string comparison performance and sort key impact if this attribute is set to be On.
|
|||||||||
| Case_First | C | X, L, U, D |
The Case_First attribute is used to control whether uppercase letters come before lowercase letters or vice versa, in the absence of other differences in the strings. The possible values are Uppercase_First (U) and Lowercase_First (L), plus the standard Default and Off. There is almost no difference between the Off and Lowercase_First options in terms of results, so typically users will not use Lowercase_First: only Off or Uppercase_First. (People interested in the detailed differences between X and L should consult the Collation Customization). Specifying either L or U won't affect string comparison performance, but will affect the sort key length..
|
|||||||||
| Alternate | A | N, S, D | The Alternate attribute is used to control the handling of
the so-called variable characters in the UCA: whitespace,
punctuation and symbols. If Alternate is set to Non-Ignorable (N), then
differences among these characters are of the same importance as
differences among letters. If Alternate is set to Shifted (S), then these
characters are of only minor importance. The Shifted value is often used
in combination with Strength set to Quaternary. In such a case,
white-space, punctuation, and symbols are considered when comparing
strings, but only if all other aspects of the strings (base letters,
accents, and case) are identical. If Alternate is not set to Shifted, then
there is no difference between a Strength of 3 and a Strength of 4.
For more information and examples, see Variable_Weighting in the UCA. The reason the Alternate values are not simply On and Off is that additional Alternate values may be added in the future. The UCA option Blanked is expressed with Strength set to 3, and Alternate set to Shifted. The default for most locales is Non-Ignorable. If Shifted is selected, it may be slower if there are many strings that are the same except for punctuation; sort key length will not be affected unless the strength level is also increased.
|
|||||||||
| Variable_Top | T | <string> | The Variable_Top attribute is only meaningful if the
Alternate attribute is not set to Non-Ignorable. In such a case, it
controls which characters count as ignorable. The string value specifies
the "highest" character (in UCA order) weight that is to be
considered ignorable.
Thus, for example, if a user wanted white-space to be ignorable, but not any visible characters, then s/he would use the value Variable_Top="\u0020" (space). The string should only be a single character. All characters of the same primary weight are equivalent, so Variable_Top="\u3000" (ideographic space) has the same effect as Variable_Top="\u0020". This setting (alone) has little impact on string comparison performance; setting it lower or higher will make sort keys slightly shorter or longer respectively
|
|||||||||
| Normalization | N | X, O, D |
The Normalization setting determines whether text is thoroughly normalized or not in comparison. Even if the setting is off (which is the default for many locales), text as represented in common usage will compare correctly (for details, see UTN #5). Only if the accent marks are in non-canonical order will there be a problem. If the setting is On, then the best results are guaranteed for all possible text input. There is a medium string comparison performance cost if this attribute is On, depending on the frequency of sequences that require normalization. There is no significant effect on sort key length.
|
|||||||||
| French | F | X, O, D |
The French sort strings with different accents from the back of the string. This attribute is automatically set to On for the French locales and a few others. Users normally would not need to explicitly set this attribute. There is a string comparison performance cost when it is set On, but sort key length is unaffected.
|
|||||||||
| Hiragana | H | X, O, D |
Compatibility with JIS x 4061 requires the introduction of an additional level to distinguish Hiragana and Katakana characters. If compatibility with that standard is required, then this attribute should be set On, and the strength set to Quaternary. This will affect sort key length and string comparison string comparison performance.
|
| Value | Abb. |
|---|---|
| Default | D |
| On | O |
| Off | X |
| Primary | 1 |
| Secondary | 2 |
| Tertiary | 3 |
| Quarternary | 4 |
| Identical | I |
| Shifted | S |
| Non-Ignorable | N |
| Lower-First | L |
| Upper-First | U |
In many database products, fields are padded with null. To get correct results, the input to a Collator should omit any superfluous trailing padding spaces. The problem arises with contractions, expansions, or normalization. Suppose that there are two fields, one containing "aed" and the other with "äd". A traditional German sort will compare "ä" as if it were "ae" (on a primary level), so the order will be "äd" < "aed". But if both fields are padded with spaces to a length of 3, then this will reverse the order, since the first will compare as if it were one character longer. In other words, when you start with strings 1 and 2
| 1. | a | e | d | <space> |
| 2. | ä | d | <space> | <space> |
they end up being compared on a primary level as if they were 1' and 2'
| 1'. | a | e | d | <space> | |
| 2'. | a | e | d | <space> | <space> |
Since 2' has an extra character (the extra space), it counts as having a primary difference when it shouldn't. The correct result occurs when the trailing padding spaces are removed, as in 1" and 2"
| 1". | a | e | d |
| 2". | a | e | d |