Draft 2001-08-06
The following gives a draft proposed list of additional properties that could be used in building a UnicodeSet, which is used in transliteration (filters and rules), rule-based break iteration, and as a general utility. While we may not offer all of these properties initially in ICU, the eventual goal would be to support them all.
There will be two ways of invoking these in building a UnicodeSet. The syntax is an extension of POSIX or Perl syntax, with the addition of "=value". In the tables below, only the Perl-style is given.
| Positive | Negative | |
|---|---|---|
| Posix-style syntax: | [:type=value:] | [:^type=value:] |
| Perl-style syntax: | \p{type=value} | \P{type=value} |
There are three variants given in the tables.
If the type or value are omitted, then the equals sign is also. The short style is used only for GeneralCategory and Script, since these are very common and the omission is unambiguous. The value is omitted only in the case of binary properties, whose values are only TRUE and FALSE (alternates: T, F, Y, N, Yes, No). For brevity, the binary properties are listed without values in the tables. NumericValue and CombiningClass can take numeric values. Only one sample is given in the table.
In actual practice, you can mix type names that are omitted, abbreviated, or full with values that are omitted, abbreviated or full. So for GeneralCategory=Unassigned you could use what is in the table explicitly, and you could also use \p{gc=Unassigned}, \p{GeneralCategory=Cn}, and \p{Unassigned}. For the binary property WhiteSpace, you could use \p{WhiteSpace}, \p{WhiteSpace=T}, \p{WhiteSpace=TRUE}, or even \P{WhiteSpace=F} (although the latter is probably best avoided!).
Case and whitespace are ignored when these are processed, so you can use them for clarity if desired. E.g. \p{General Category = Uppercase Letter} or \p{general category = uppercase letter}.
Note: the GeneralCategory is already supported by UnicodeSet in ICU 1.8, but only in the short form. There are also some special values in the General Category:For omitted properties, see Omitted Properties. For a possible implementation, see Possible Implementation. For examples of these in transliteration, see Transliteration Rule Tutorial.
| General Category | ||
|---|---|---|
| Short | Medium | Long |
| \p{Any} | \p{gc=Any} | \p{GeneralCategory=Any} |
| \p{L} | \p{gc=L} | \p{GeneralCategory=Letter} |
| \p{M} | \p{gc=M} | \p{GeneralCategory=Mark} |
| \p{N} | \p{gc=N} | \p{GeneralCategory=Number} |
| \p{C} | \p{gc=C} | \p{GeneralCategory=Other} |
| \p{Z} | \p{gc=Z} | \p{GeneralCategory=Separator} |
| \p{P} | \p{gc=P} | \p{GeneralCategory=Punctuation} |
| \p{S} | \p{gc=S} | \p{GeneralCategory=Symbol} |
| \p{Lu} | \p{gc=Lu} | \p{GeneralCategory=UppercaseLetter} |
| \p{Ll} | \p{gc=Ll} | \p{GeneralCategory=LowercaseLetter} |
| \p{Lt} | \p{gc=Lt} | \p{GeneralCategory=TitlecaseLetter} |
| \p{Lm} | \p{gc=Lm} | \p{GeneralCategory=ModifierLetter} |
| \p{Lo} | \p{gc=Lo} | \p{GeneralCategory=OtherLetter} |
| \p{Mn} | \p{gc=Mn} | \p{GeneralCategory=NonspacingMark} |
| \p{Me} | \p{gc=Me} | \p{GeneralCategory=EnclosingMark} |
| \p{Mc} | \p{gc=Mc} | \p{GeneralCategory=SpacingMark} |
| \p{Nd} | \p{gc=Nd} | \p{GeneralCategory=DecimalNumber} |
| \p{Nl} | \p{gc=Nl} | \p{GeneralCategory=LetterNumber} |
| \p{No} | \p{gc=No} | \p{GeneralCategory=OtherNumber} |
| \p{Zs} | \p{gc=Zs} | \p{GeneralCategory=SpaceSeparator} |
| \p{Zl} | \p{gc=Zl} | \p{GeneralCategory=LineSeparator} |
| \p{Zp} | \p{gc=Zp} | \p{GeneralCategory=ParagraphSeparator} |
| \p{Pd} | \p{gc=Pd} | \p{GeneralCategory=DashPunctuation} |
| \p{Ps} | \p{gc=Ps} | \p{GeneralCategory=OpenPunctuation} |
| \p{Pi} | \p{gc=Pi} | \p{GeneralCategory=InitialPunctuation} |
| \p{Pe} | \p{gc=Pe} | \p{GeneralCategory=ClosePunctuation} |
| \p{Pf} | \p{gc=Pf} | \p{GeneralCategory=FinalPunctuation} |
| \p{Pc} | \p{gc=Pc} | \p{GeneralCategory=ConnectorPunctuation} |
| \p{Po} | \p{gc=Po} | \p{GeneralCategory=OtherPunctuation} |
| \p{Sm} | \p{gc=Sm} | \p{GeneralCategory=MathSymbol} |
| \p{Sc} | \p{gc=Sc} | \p{GeneralCategory=CurrencySymbol} |
| \p{Sk} | \p{gc=Sk} | \p{GeneralCategory=ModifierSymbol} |
| \p{So} | \p{gc=So} | \p{GeneralCategory=OtherSymbol} |
| \p{Cc} | \p{gc=Cc} | \p{GeneralCategory=Control} |
| \p{Cf} | \p{gc=Cf} | \p{GeneralCategory=Format} |
| \p{Co} | \p{gc=Co} | \p{GeneralCategory=PrivateUse} |
| \p{Cs} | \p{gc=Cs} | \p{GeneralCategory=Surrogate} |
| \p{Cn} | \p{gc=Cn} | \p{GeneralCategory=Unassigned} |
| Combining Class | |
|---|---|
| Medium | Long |
| \p{cc=99} | \p{CombiningClass=99} |
| \p{cc=rn} | \p{CombiningClass=NotReordered} |
| \p{cc=ov} | \p{CombiningClass=Overlay} |
| \p{cc=nu} | \p{CombiningClass=Nukta} |
| \p{cc=kv} | \p{CombiningClass=KanaVoicing} |
| \p{cc=vi} | \p{CombiningClass=Virama} |
| \p{cc=abl} | \p{CombiningClass=AttachedBelowLeft} |
| \p{cc=aar} | \p{CombiningClass=AttachedAboveRight} |
| \p{cc=bl} | \p{CombiningClass=BelowLeft} |
| \p{cc=b} | \p{CombiningClass=Below} |
| \p{cc=br} | \p{CombiningClass=BelowRight} |
| \p{cc=l} | \p{CombiningClass=Left} |
| \p{cc=r} | \p{CombiningClass=Right} |
| \p{cc=al} | \p{CombiningClass=AboveLeft} |
| \p{cc=a} | \p{CombiningClass=Above} |
| \p{cc=ar} | \p{CombiningClass=AboveRight} |
| \p{cc=db} | \p{CombiningClass=DoubleBelow} |
| \p{cc=da} | \p{CombiningClass=DoubleAbove} |
| \p{cc=is} | \p{CombiningClass=IotaSubscript} |
| Bidi Class | |
|---|---|
| Medium | Long |
| \p{bc=L} | \p{BidiClass=LeftToRight} |
| \p{bc=R} | \p{BidiClass=RightToLeft} |
| \p{bc=AL} | \p{BidiClass=ArabicLetter} |
| \p{bc=EN} | \p{BidiClass=EuropeanNumber} |
| \p{bc=ES} | \p{BidiClass=EuropeanSeparator} |
| \p{bc=ET} | \p{BidiClass=EuropeanTerminator} |
| \p{bc=AN} | \p{BidiClass=ArabicNumber} |
| \p{bc=CS} | \p{BidiClass=CommonSeparator} |
| \p{bc=B} | \p{BidiClass=ParagraphSeparator} |
| \p{bc=S} | \p{BidiClass=SegmentSeparator} |
| \p{bc=WS} | \p{BidiClass=WhiteSpace} |
| \p{bc=ON} | \p{BidiClass=OtherNeutral} |
| \p{bc=BN} | \p{BidiClass=BoundaryNeutral} |
| \p{bc=NSM} | \p{BidiClass=NonspacingMark} |
| \p{bc=LRO} | \p{BidiClass=LeftToRightOverride} |
| \p{bc=RLO} | \p{BidiClass=RightToLeftOverride} |
| \p{bc=LRE} | \p{BidiClass=LeftToRightEmbedding} |
| \p{bc=RLE} | \p{BidiClass=RightToLeftEmbedding} |
| \p{bc=PDF} | \p{BidiClass=PopDirectionalFormat} |
| Decomposition Type | |
|---|---|
| Medium | Long |
| \p{dt=no} | \p{DecompositionType=none} |
| \p{dt=ca} | \p{DecompositionType=canonical} |
| \p{dt=co} | \p{DecompositionType=compat} |
| \p{dt=fo} | \p{DecompositionType=font} |
| \p{dt=nb} | \p{DecompositionType=noBreak} |
| \p{dt=in} | \p{DecompositionType=initial} |
| \p{dt=me} | \p{DecompositionType=medial} |
| \p{dt=fi} | \p{DecompositionType=final} |
| \p{dt=is} | \p{DecompositionType=isolated} |
| \p{dt=ci} | \p{DecompositionType=circle} |
| \p{dt=sp} | \p{DecompositionType=super} |
| \p{dt=sb} | \p{DecompositionType=sub} |
| \p{dt=ve} | \p{DecompositionType=vertical} |
| \p{dt=wi} | \p{DecompositionType=wide} |
| \p{dt=na} | \p{DecompositionType=narrow} |
| \p{dt=sm} | \p{DecompositionType=small} |
| \p{dt=sq} | \p{DecompositionType=square} |
| \p{dt=fr} | \p{DecompositionType=fraction} |
| Numeric Value | |
|---|---|
| Medium | Long |
| \p{nv=1.5} | \p{NumericValue=14.0} |
| Numeric Type | |
|---|---|
| Medium | Long |
| \p{nt=no} | \p{NumericType=none} |
| \p{nt=nu} | \p{NumericType=numeric} |
| \p{nt=di} | \p{NumericType=digit} |
| \p{nt=de} | \p{NumericType=decimal} |
| East Asian Width | |
|---|---|
| Medium | Long |
| \p{ea=N} | \p{EastAsianWidth=Neutral} |
| \p{ea=A} | \p{EastAsianWidth=Ambiguous} |
| \p{ea=H} | \p{EastAsianWidth=Halfwidth} |
| \p{ea=W} | \p{EastAsianWidth=Wide} |
| \p{ea=F} | \p{EastAsianWidth=Fullwidth} |
| \p{ea=Na} | \p{EastAsianWidth=Narrow} |
| Line Break | |
|---|---|
| Medium | Long |
| \p{lb=OP} | \p{LineBreak=OpenPunctuation} |
| \p{lb=CL} | \p{LineBreak=ClosePunctuation} |
| \p{lb=QU} | \p{LineBreak=Quotation} |
| \p{lb=GL} | \p{LineBreak=Glue} |
| \p{lb=NS} | \p{LineBreak=Nonstarter} |
| \p{lb=EX} | \p{LineBreak=Exclamation} |
| \p{lb=SY} | \p{LineBreak=BreakSymbols} |
| \p{lb=IS} | \p{LineBreak=InfixNumeric} |
| \p{lb=PR} | \p{LineBreak=PrefixNumeric} |
| \p{lb=PO} | \p{LineBreak=PostfixNumeric} |
| \p{lb=NU} | \p{LineBreak=Numeric} |
| \p{lb=AL} | \p{LineBreak=Alphabetic} |
| \p{lb=ID} | \p{LineBreak=Ideographic} |
| \p{lb=IN} | \p{LineBreak=Inseperable} |
| \p{lb=HY} | \p{LineBreak=Hyphen} |
| \p{lb=CM} | \p{LineBreak=CombiningMark} |
| \p{lb=BB} | \p{LineBreak=BreakBefore} |
| \p{lb=BA} | \p{LineBreak=BreakAfter} |
| \p{lb=B2} | \p{LineBreak=BreakBeforeAndAfter} |
| \p{lb=SP} | \p{LineBreak=Space} |
| \p{lb=BK} | \p{LineBreak=MandatoryBreak} |
| \p{lb=CR} | \p{LineBreak=CarriageReturn} |
| \p{lb=LF} | \p{LineBreak=LineFeed} |
| \p{lb=CB} | \p{LineBreak=ContingentBreak} |
| \p{lb=SA} | \p{LineBreak=ComplexContext} |
| \p{lb=AI} | \p{LineBreak=Ambiguous} |
| \p{lb=SG} | \p{LineBreak=Surrogate} |
| \p{lb=ZW} | \p{LineBreak=ZWSpace} |
| \p{lb=XX} | \p{LineBreak=Unknown} |
| Joining Type | |
|---|---|
| Medium | Long |
| \p{jt=C} | \p{JoiningType=JoinCausing} |
| \p{jt=D} | \p{JoiningType=DualJoining} |
| \p{jt=R} | \p{JoiningType=RightJoining} |
| \p{jt=U} | \p{JoiningType=NonJoining} |
| \p{jt=L} | \p{JoiningType=LeftJoining} |
| \p{jt=T} | \p{JoiningType=Transparent} |
| Script | ||
|---|---|---|
| Short | Medium | Long |
| \p{Zyyy} | \p{sc=Zyyy} | \p{Script=COMMON} |
| \p{Latn} | \p{sc=Latn} | \p{Script=LATIN} |
| \p{Grek} | \p{sc=Grek} | \p{Script=GREEK} |
| \p{Cyrl} | \p{sc=Cyrl} | \p{Script=CYRILLIC} |
| \p{Armn} | \p{sc=Armn} | \p{Script=ARMENIAN} |
| \p{Hebr} | \p{sc=Hebr} | \p{Script=HEBREW} |
| \p{Arab} | \p{sc=Arab} | \p{Script=ARABIC} |
| \p{Syrc} | \p{sc=Syrc} | \p{Script=SYRIAC} |
| \p{Thaa} | \p{sc=Thaa} | \p{Script=THAANA} |
| \p{Deva} | \p{sc=Deva} | \p{Script=DEVANAGARI} |
| \p{Beng} | \p{sc=Beng} | \p{Script=BENGALI} |
| \p{Guru} | \p{sc=Guru} | \p{Script=GURMUKHI} |
| \p{Gujr} | \p{sc=Gujr} | \p{Script=GUJARATI} |
| \p{Orya} | \p{sc=Orya} | \p{Script=ORIYA} |
| \p{Taml} | \p{sc=Taml} | \p{Script=TAMIL} |
| \p{Telu} | \p{sc=Telu} | \p{Script=TELUGU} |
| \p{Knda} | \p{sc=Knda} | \p{Script=KANNADA} |
| \p{Mlym} | \p{sc=Mlym} | \p{Script=MALAYALAM} |
| \p{Sinh} | \p{sc=Sinh} | \p{Script=SINHALA} |
| \p{Thai} | \p{sc=Thai} | \p{Script=THAI} |
| \p{Laoo} | \p{sc=Laoo} | \p{Script=LAO} |
| \p{Tibt} | \p{sc=Tibt} | \p{Script=TIBETAN} |
| \p{Mymr} | \p{sc=Mymr} | \p{Script=MYANMAR} |
| \p{Geor} | \p{sc=Geor} | \p{Script=GEORGIAN} |
| \p{Hang} | \p{sc=Hang} | \p{Script=HANGUL} |
| \p{Ethi} | \p{sc=Ethi} | \p{Script=ETHIOPIC} |
| \p{Cher} | \p{sc=Cher} | \p{Script=CHEROKEE} |
| \p{Cans} | \p{sc=Cans} | \p{Script=CANADIAN-ABORIGINAL} |
| \p{Ogam} | \p{sc=Ogam} | \p{Script=OGHAM} |
| \p{Runr} | \p{sc=Runr} | \p{Script=RUNIC} |
| \p{Khmr} | \p{sc=Khmr} | \p{Script=KHMER} |
| \p{Mong} | \p{sc=Mong} | \p{Script=MONGOLIAN} |
| \p{Hira} | \p{sc=Hira} | \p{Script=HIRAGANA} |
| \p{Kana} | \p{sc=Kana} | \p{Script=KATAKANA} |
| \p{Bopo} | \p{sc=Bopo} | \p{Script=BOPOMOFO} |
| \p{Hani} | \p{sc=Hani} | \p{Script=HAN} |
| \p{Yiii} | \p{sc=Yiii} | \p{Script=YI} |
| \p{Ital} | \p{sc=Ital} | \p{Script=OLD-ITALIC} |
| \p{Goth} | \p{sc=Goth} | \p{Script=GOTHIC} |
| \p{Dsrt} | \p{sc=Dsrt} | \p{Script=DESERET} |
| \p{Qaai} | \p{sc=Qaai} | \p{Script=INHERITED} |
| Extended Properties (Binary) | ||
|---|---|---|
| Short | Medium | Long |
| \p{BidiM} | \p{BidiM=T} | \p{BidiMirrored=True} |
| \p{CExc} | \p{CExc=T} | \p{CompositionExclusion=True} |
| \p{WhSp} | \p{WhSp=T} | \p{WhiteSpace=True} |
| \p{NBrk} | \p{NBrk=T} | \p{NonBreak=True} |
| \p{BdCon} | \p{BdCon=T} | \p{BidiControl=True} |
| \p{JCon} | \p{JCon=T} | \p{JoinControl=True} |
| \p{Dash} | \p{Dash=T} | \p{Dash=True} |
| \p{Hyph} | \p{Hyph=T} | \p{Hyphen=True} |
| \p{QMark} | \p{QMark=T} | \p{QuotationMark=True} |
| \p{TPunc} | \p{TPunc=T} | \p{TerminalPunctuation=True} |
| \p{OMath} | \p{OMath=T} | \p{OtherMath=True} |
| \p{HexD} | \p{HexD=T} | \p{HexDigit=True} |
| \p{OAlph} | \p{OAlph=T} | \p{OtherAlphabetic=True} |
| \p{Ideo} | \p{Ideo=T} | \p{Ideographic=True} |
| \p{Diac} | \p{Diac=T} | \p{Diacritic=True} |
| \p{Ext} | \p{Ext=T} | \p{Extender=True} |
| \p{OLoc} | \p{OLoc=T} | \p{OtherLowercase=True} |
| \p{OUpc} | \p{OUpc=T} | \p{OtherUppercase=True} |
| \p{NChar} | \p{NChar=T} | \p{NoncharacterCodePoint=True} |
| \p{AHexD} | \p{AHexD=T} | \p{ASCIIHexDigit=True} |
| Derived Core Properties (Binary) | ||
|---|---|---|
| Short | Medium | Long |
| \p{Math} | \p{Math=T} | \p{Math=True} |
| \p{Alph} | \p{Alph=T} | \p{Alphabetic=True} |
| \p{Loc} | \p{Loc=T} | \p{Lowercase=True} |
| \p{Upc} | \p{Upc=T} | \p{Uppercase=True} |
| \p{IDS} | \p{IDS=T} | \p{IDStart=True} |
| \p{IDC} | \p{IDC=T} | \p{IDContinue=True} |
| \p{XIDS} | \p{XIDS=T} | \p{XIDStart=True} |
| \p{XIDC} | \p{XIDC=T} | \p{XIDContinue=True} |
| Derived Normalization Properties (Non Binary) | |
|---|---|
| Medium | Long |
| \p{NFCP=N} | \p{NFCPermitted=NO} |
| \p{NFCP=M} | \p{NFCPermitted=MAYBE} |
| \p{NFCP=Y} | \p{NFCPermitted=YES} |
| \p{NFKCP=N} | \p{NFKCPermitted=NO} |
| \p{NFKCP=M} | \p{NFKCPermitted=MAYBE} |
| \p{NFKCP=Y} | \p{NFKCPermitted=YES} |
| Derived Normalization Properties (Binary) | ||
|---|---|---|
| Medium | Medium | Long |
| \p{NFDP} | \p{NFDP=T} | \p{NFDPermitted=True} |
| \p{NFKDP} | \p{NFKDP=T} | \p{NFKDPermitted=True} |
| \p{FNC} | \p{FNC=T} | \p{FNC=True} |
| \p{CompEx} | \p{CompEx=T} | \p{CompEx=True} |
| \p{NFCX} | \p{NFCX=T} | \p{NFCExpands=True} |
| \p{NFKCX} | \p{NFKCX=T} | \p{NFKCExpands=True} |
| \p{NFDX} | \p{NFDX=T} | \p{NFDExpands=True} |
| \p{NFKDX} | \p{NFKDX=T} | \p{NFKDExpands=True} |
| \p{NFCX} | \p{NFCX=T} | \p{NFCExpands=True} |
The properties Block, JamoName, and JoiningGroup are omitted, since they are not generally useful. (The Block property is actually pernicious.)
The String properties are also omitted. If included, they would look something like:
\p{Name='LATIN LETTER A'}
\p{NFC=Å}
\p{SimpleUppercase=Å}
\p{FullUppercase=Å}
\p{SpecialCasing="tr"}\p{BidiMirror=')'}(Java style API: ICU C++ would be analogous)
Add a function and interface to Character:
public void addToSet(String typeName, String valueName, UnicodeSet set);
The addToSet function will internally iterate over all codepoints, adding the ones that match the criterion. Since it can work with the internal data structures for the properties, it can do this more efficiently.
If the valueName is empty or null, then two different strategies are used. If the typeName is valid, then the value is set to TRUE. Otherwise the typeName is invalid, so the code will try using "gc" as the typeName and the former typeName as the valueName to see those are valid. If that fails, it will try "sc". If that fails, an error
Internally, we will use a simple implementation. All of the typeNames and valueNames will be added to two hashtables that maps them to integers. (We might make this one hashtable, and enforce that no string can be both a typeName and valueName). A value integer will contain the real value, plus some bits that tell which type it is valid for. Those validity bits are used to check consistency. The type integer will be used to select a property, and the internal iteration will pick up all code points that match the value. For cases where the same valueName has two different values for two different properties, we'll just use a little special-case code to remap the value if necessary when we call all but the first property. Where the value is numeric, we'll detect that based on the typeName, and parse for a double.