Mark Davis, 2003-06-05 (later)
Note: all offsets are char offsets, except in findOffsetFromCodePoint
- Character
- CharSequence
- StringBuffer
- CharacterIterator
| Field Summary | |
static int |
CODEPOINT_MIN_VALUEThe lowest Unicode code point value. (0) |
static int |
CODE_POINT_MAX_VALUEThe highest Unicode code point value according to the Unicode Standard. (0x10FFFF) |
static int |
SUPPLEMENTARY_MIN_VALUEThe minimum value for Supplementary code points (0x10000) |
static int |
SURROGATE_MIN_VALUESurrogate minimum value (0xD800) |
static int |
LEAD_SURROGATE_MIN_VALUELead surrogate minimum value (0xD800) |
static int |
LEAD_SURROGATE_MAX_VALUELead surrogate maximum value (0xDBFF) |
static int |
TRAIL_SURROGATE_MIN_VALUETrail surrogate minimum value (0xDC00) |
static int |
TRAIL_SURROGATE_MAX_VALUETrail surrogate maximum value (0xDFFF) |
static int |
SURROGATE_MAX_VALUEMaximum surrogate value (0xDFFF) |
| Nested Class Summary | |
static class |
StringCPComparatorUTF16 string comparator class. Its compare method differs from the String compareTo method in that the comparison is by code point, not code unit. Thus "\uD800\DC00" > "\uFFFF" |
| Method Summary | |
static int |
getCodePoint(char lead,
char trail)Returns a code point corresponding to the two UTF16 characters. |
static int |
getCharCount(int char32)Determines how many chars this char32 requires. |
static char |
getLeadSurrogate(int char32)Returns the lead surrogate. |
static char |
getTrailSurrogate(int char32)Returns the trail surrogate. |
static boolean |
isSurrogate(char char16)Determines whether the code value is a surrogate. |
static boolean |
isLeadSurrogate(char char16)Determines whether the character is a lead surrogate. |
static boolean |
isTrailSurrogate(char char16)Determines whether the character is a trail surrogate. |
Other property methods on Character need overloads for int, as in ICU4J's UCharacter.html.
The following functions should go on subclasses of CharSequence (char32At should go on CharSequence itself, if possible).
| Field Summary | |
static int |
SINGLE_CHAR_BOUNDARYValue returned in bounds32(offset).
The substring bounds for the code point that contains charAt(offset) are
(offset, offset+1). That is, offset is at the start of a BMP
character (single char). |
static int |
LEAD_SURROGATE_BOUNDARYValue returned in bounds32(offset).
The substring bounds for the code point that contains charAt(offset) are
(offset, offset+2). That is, offset is at the start of a surrogate pair. |
static int |
TRAIL_SURROGATE_BOUNDARYValue returned in bounds32(offset).
The substring bounds for the code point that contains charAt(offset) are
(offset-1, offset+1). That is, offset is in the middle of a
surrogate pair. |
| Method Summary | |
int |
bounds32(int offset)Returns the type of the boundaries around the char at offset. |
int |
char32At(int offset)Extract a single UTF-32 value from a string. If offset is in the middle of a code point, returns that code point. |
int |
length32()Number of code points in a UTF16 String. Equivalent to findcode pointOffset(length()) |
int |
findCodePointOffset(int offset)Returns the UTF-32 offset corresponding to the first UTF-32 boundary at or after the given UTF-16 offset. Inverse of findCodePointOffset. |
int |
findOffsetFromCodePoint(int offset32)Returns the UTF-16 offset that corresponds to a UTF-32 offset. Inverse of findCodePointOffset. |
int |
getCharCount(int
offset)Determines how many chars the code point at offset requires. If offset is in the middle of a code point, returns 2, otherwise one. |
java.lang.String |
replace(int oldChar32,
int newChar32)Returns a new UTF16 format Unicode string resulting from replacing all occurrences of oldChar32 in source with newChar32. |
java.lang.String |
replace(java.lang.String oldStr,
java.lang.String newStr)Returns a new UTF16 format Unicode string resulting from replacing all occurrences of oldStr in source with newStr. |
Utilities
for Speed |
|
boolean |
hasMoreCodePointsThan(int number)Check if the string contains more Unicode code points than a certain number. Equivalent to but much faster than: str.length32()
> number |
int |
moveCodePointOffset(java.lang.String source,
int offset, int shift32)Shifts offset by the argument number of code points. Equivalent to but much faster than: str.findOffsetFromCodePoint(str.findCodePointOffset(offset)
+ shift32) |
java.lang.String |
valueOf32(int offset)Convenience method corresponding to, but much faster than, String.valueOf(myString.charAt(offset)).
Internally uses .substring(x,y), where x and y are determined by calling
bounds32() |
Note: Some existing methods need to be modified to work with supplementaries, such as:
reverse()It needs to not reverse surrogate-pairs.
indexOf() Interpret int as code point
lastIndexOf()
| Method Summary | |
java.lang.StringBuffer |
append32(int char32)Append a single UTF-32 value to the end of a StringBuffer. Equivalent to but faster than: append(String.value32Of(char32)) |
java.lang.StringBuffer |
delete32(int offset)Removes the code point at the specified position in this target (shortening target by 1 character if the code point is a non-supplementary, 2 otherwise). Equivalent to (but faster than): int bounds = bounds32(offset); |
java.lang.StringBuffer |
insert32(int offset,
int char32)Inserts char32 code point into target at (or before) the argument offset. Equivalent to but faster than: int bounds = bounds32(offset);insert(String.value32Of(char32), start); |
void |
setChar32At(int offset,
int char32)Set a code point into a UTF16 position. Equivalent to but faster than: delete(offset); |
We had made a mistake in our original design of CharacterIterator.html:
This doesn't extend well to cases where the next boundary is not simply adding an increment, since the function becomes less efficient.
There are some possibilities:
1. In our ICU4J version: UCharacterIterator, we simply produced a different iterator that used better semantics.
2. Alternatively, we could modify CharacterIterator to add methods that deal with code points and address the next() issue by adding alternative methods (see below). We did that in our C++ version.
3. Another possibility is to have an abstract base class that implements CharacterIterator, but has default implementation for everything but move() and current(). (All other functions can be defined reasonably in terms of those; subclasses can override for further efficiency.)
4. Factoring into this is how much CharacterIterator is actually used nowadays, and how much its use has been supplanted by CharSequence. It could be deprecated in favor of CharSequence, making a few modifications then to the following:
int |
setToStart
() |
| Sets the iterator to refer to the first code unit or
code point in its iteration range. This can be used to begin a forward
iteration with nextPostInc()
or next32PostInc(). Returns: the start position of the iteration range |
|
int |
firstPostInc
() |
| Sets the iterator to refer to the first code unit in
its iteration range, returns that code unit, and moves the position to the
second code unit. This is an alternative to setToStart()
for forward iteration with nextPostInc(). Returns: the first code unit in its iteration range. |
|
int |
first32PostInc
() |
| Sets the iterator to refer to the first code point
in its iteration range, returns that code point, and moves the position to
the second code point. This is an
alternative to setToStart()
for forward iteration with next32PostInc(). Returns: the first code point in its iteration range. |
|
int |
next32PostInc
() |
| Gets the current code point for returning and
advances to the next code point in the iteration range (toward endIndex()).
If there are no more code points to return, returns DONE. Returns: the current code point. |
|
int |
first32
() |
| Sets the iterator to refer to the
first code point in its iteration range, and returns that code unit. This
can be used to begin an iteration with next32().
Note that an iteration with next32PostInc(),
beginning with, e.g., setToStart()
or firstPostInc(),
is more efficient. Returns: the first code point in its iteration range. |
|
int |
next32
() |
| Advances to the next code point in
the iteration range (toward endIndex()),
and returns that code point. If there are no more code points to return,
returns DONE. Note that iteration with "pre-increment" semantics
is less efficient than iteration with "post-increment"
semantics that is provided by next32PostInc(). Returns: the next code point. |
|
int |
setToEnd
() |
| Sets the iterator to the end of its iteration range,
just behind the last code unit or code point. This can be used to begin a
backward iteration with previous()
or previous32(). Returns: the end position of the iteration range |
|
int |
last32
() |
| Sets the iterator to refer to the last code point in
its iteration range, and returns that code point. This
can be used to begin an iteration with previous32(). Returns: the last code point. |
|
int |
previous32
() |
| Advances to the previous code point in the iteration
range (toward startIndex()),
and returns that code point. If there are no more code points to return,
returns DONE. Returns: the previous code point. |
|
int |
setIndex32
(int
position)=0 |
| Sets the iterator to refer to the beginning of the code point that contains the "position"-th code unit in the text-storage object the iterator refers to, and returns that code point. | |
int |
current32
(void) const=0 |
| Returns the code point the iterator currently refers to. | |
Utilities for
Speed |
|
int |
move
(int delta, EOrigin
origin)=0 |
| Moves the current position relative to the start or end of the iteration range, or relative to the current position itself. | |
int |
move32
(int delta, EOrigin
origin)=0 |
| Moves the current position relative to the start or end of the iteration range, or relative to the current position itself. | |
With these additions, you can adapt old loops straightforwardly, just by using the versions with "32". Or you can use a more efficient model for forward iteration, by using the "32PostInc" versions.
StringBuffer.getCharCount(int offset)
and Character.getCharCount(int char32) different names. While
the target functionality is the same, there could be confusion since the
arguments are both ints: one is an offset and one is a code point.char32At() cannot go on CharSequence, it might be worth
having a CodePointSequence that extends CharSequence that adds a similar
(e.g. small) number of methods such as char32At(). String and
StringBuffer could then implement CodePointSequence.