Proposed Supplementary API (JSR 204)

Mark Davis, 2003-06-05 (later)

Note: all offsets are char offsets, except in findOffsetFromCodePoint

Contents

Character
CharSequence
StringBuffer
CharacterIterator

Character

Field Summary
static int CODEPOINT_MIN_VALUE
          The lowest Unicode code point value. (0)
static int CODE_POINT_MAX_VALUE
          The highest Unicode code point value according to the Unicode Standard. (0x10FFFF)
static int SUPPLEMENTARY_MIN_VALUE
          The minimum value for Supplementary code points (0x10000)
   
static int SURROGATE_MIN_VALUE
          Surrogate minimum value (0xD800)
static int LEAD_SURROGATE_MIN_VALUE
          Lead surrogate minimum value (0xD800)
static int LEAD_SURROGATE_MAX_VALUE
          Lead surrogate maximum value (0xDBFF)
static int TRAIL_SURROGATE_MIN_VALUE
          Trail surrogate minimum value (0xDC00)
static int TRAIL_SURROGATE_MAX_VALUE
          Trail surrogate maximum value (0xDFFF)
static int SURROGATE_MAX_VALUE
          Maximum surrogate value (0xDFFF)

 

Nested Class Summary
static class StringCPComparator
          UTF16 string comparator class. Its compare method differs from the String compareTo method in that the comparison is by code point, not code unit. Thus "\uD800\DC00" > "\uFFFF"

 

Method Summary
static int getCodePoint(char lead, char trail)
          Returns a code point corresponding to the two UTF16 characters.
static int getCharCount(int char32)
          Determines how many chars this char32 requires.
static char getLeadSurrogate(int char32)
          Returns the lead surrogate.
static char getTrailSurrogate(int char32)
          Returns the trail surrogate.
static boolean isSurrogate(char char16)
          Determines whether the code value is a surrogate.
static boolean isLeadSurrogate(char char16)
          Determines whether the character is a lead surrogate.
static boolean isTrailSurrogate(char char16)
          Determines whether the character is a trail surrogate.

Other property methods on Character need overloads for int, as in ICU4J's UCharacter.html.

CharSequence (and subclasses)

The following functions should go on subclasses of CharSequence (char32At should go on CharSequence itself, if possible).

Field Summary
static int SINGLE_CHAR_BOUNDARY
          Value returned in bounds32(offset). The substring bounds for the code point that contains charAt(offset) are (offset, offset+1).  That is, offset is at the start of a BMP character (single char).
static int LEAD_SURROGATE_BOUNDARY
          Value returned in bounds32(offset). The substring bounds for the code point that contains charAt(offset) are (offset, offset+2). That is, offset is at the start of a surrogate pair.
static int TRAIL_SURROGATE_BOUNDARY
          Value returned in bounds32(offset). The substring bounds for the code point that contains charAt(offset) are (offset-1, offset+1).  That is, offset is in the middle of a surrogate pair.

 

Method Summary
int bounds32(int offset)
          Returns the type of the boundaries around the char at offset.
int char32At(int offset)
          Extract a single UTF-32 value from a string. If offset is in the middle of a code point, returns that code point.
int length32()
          Number of code points in a UTF16 String. Equivalent to findcode pointOffset(length())
int findCodePointOffset(int offset)
          Returns the UTF-32 offset corresponding to the first UTF-32 boundary at or after the given UTF-16 offset. Inverse of findCodePointOffset.
int findOffsetFromCodePoint(int offset32)
          Returns the UTF-16 offset that corresponds to a UTF-32 offset. Inverse of findCodePointOffset.
int getCharCount(int offset)
          Determines how many chars the code point at offset requires. If offset is in the middle of a code point, returns 2, otherwise one.
java.lang.String replace(int oldChar32, int newChar32)
          Returns a new UTF16 format Unicode string resulting from replacing all occurrences of oldChar32 in source with newChar32.
java.lang.String replace(java.lang.String oldStr, java.lang.String newStr)
          Returns a new UTF16 format Unicode string resulting from replacing all occurrences of oldStr in source with newStr.
Utilities for Speed
boolean hasMoreCodePointsThan(int number)
          Check if the string contains more Unicode code points than a certain number. Equivalent to but much faster than: str.length32() > number
int moveCodePointOffset(java.lang.String source, int offset, int shift32)
          Shifts offset by the argument number of code points. Equivalent to but much faster than: str.findOffsetFromCodePoint(str.findCodePointOffset(offset) + shift32)
java.lang.String valueOf32(int offset)
          Convenience method corresponding to, but much faster than, String.valueOf(myString.charAt(offset)). Internally uses .substring(x,y), where x and y are determined by calling bounds32()

Note: Some existing methods need to be modified to work with supplementaries, such as:

reverse()    It needs to not reverse surrogate-pairs.
indexOf()    Interpret int as code point
lastIndexOf()

StringBuffer

Method Summary
java.lang.StringBuffer append32(int char32)
          Append a single UTF-32 value to the end of a StringBuffer. Equivalent to but faster than:
append(String.value32Of(char32))
java.lang.StringBuffer delete32(int offset)
          Removes the code point at the specified position in this target (shortening target by 1 character if the code point is a non-supplementary, 2 otherwise). Equivalent to (but faster than):
int bounds = bounds32(offset);
int start = offset - (bounds == TRAIL_SURROGATE_BOUNDARY ? 1 : 0);
int end = offset + (bounds == LEAD_SURROGATE_BOUNDARY ? 2 : 1);
delete(start, end);
java.lang.StringBuffer insert32(int offset, int char32)
          Inserts char32 code point into target at (or before) the argument offset. Equivalent to but faster than:
int bounds = bounds32(offset);
int start = offset - (bounds == TRAIL_SURROGATE_BOUNDARY ? 1 : 0);
insert(String.value32Of(char32), start);
void setChar32At(int offset, int char32)
          Set a code point into a UTF16 position. Equivalent to but faster than:
delete(offset);
insert(String.value32Of(char32), offset);

CharacterIterator

We had made a mistake in our original design of CharacterIterator.html:

This doesn't extend well to cases where the next boundary is not simply adding an increment, since the function becomes less efficient.

There are some possibilities:

1. In our ICU4J version: UCharacterIterator, we simply produced a different iterator that used better semantics.

2. Alternatively, we could modify CharacterIterator to add methods that deal with code points and address the next() issue by adding alternative methods (see below). We did that in our C++ version.

3. Another possibility is to have an abstract base class that implements CharacterIterator, but has default implementation for everything but move() and current(). (All other functions can be defined reasonably in terms of those; subclasses can override for further efficiency.)

4. Factoring into this is how much CharacterIterator is actually used nowadays, and how much its use has been supplanted by CharSequence. It could be deprecated in favor of CharSequence, making a few modifications then to the following:

Additional Methods

int setToStart ()
  Sets the iterator to refer to the first code unit or code point in its iteration range. This can be used to begin a forward iteration with nextPostInc() or next32PostInc().
Returns: the start position of the iteration range
int  firstPostInc ()
  Sets the iterator to refer to the first code unit in its iteration range, returns that code unit, and moves the position to the second code unit. This is an alternative to setToStart() for forward iteration with nextPostInc().
Returns: the first code unit in its iteration range.
 int first32PostInc ()
  Sets the iterator to refer to the first code point in its iteration range, returns that code point, and moves the position to the second code point. This is an alternative to setToStart() for forward iteration with next32PostInc().
Returns:
the first code point in its iteration range.
int next32PostInc ()
  Gets the current code point for returning and advances to the next code point in the iteration range (toward endIndex()). If there are no more code points to return, returns DONE.
Returns: the current code point.
int  first32 ()
  Sets the iterator to refer to the first code point in its iteration range, and returns that code unit. This can be used to begin an iteration with next32(). Note that an iteration with next32PostInc(), beginning with, e.g., setToStart() or firstPostInc(), is more efficient.
Returns: the first code point in its iteration range.
int next32 ()
  Advances to the next code point in the iteration range (toward endIndex()), and returns that code point. If there are no more code points to return, returns DONE. Note that iteration with "pre-increment" semantics is less efficient than iteration with "post-increment" semantics that is provided by next32PostInc().
Returns: the next code point.
   
 int setToEnd ()
  Sets the iterator to the end of its iteration range, just behind the last code unit or code point. This can be used to begin a backward iteration with previous() or previous32().
Returns: the end position of the iteration range
int last32 ()
  Sets the iterator to refer to the last code point in its iteration range, and returns that code point. This can be used to begin an iteration with previous32().
Returns: the last code point.
int previous32 ()
  Advances to the previous code point in the iteration range (toward startIndex()), and returns that code point. If there are no more code points to return, returns DONE.
Returns: the previous code point.
   
int setIndex32 (int position)=0
  Sets the iterator to refer to the beginning of the code point that contains the "position"-th code unit in the text-storage object the iterator refers to, and returns that code point.
int current32 (void) const=0
  Returns the code point the iterator currently refers to.
Utilities for Speed
int move (int delta, EOrigin origin)=0
  Moves the current position relative to the start or end of the iteration range, or relative to the current position itself.
int move32 (int delta, EOrigin origin)=0
  Moves the current position relative to the start or end of the iteration range, or relative to the current position itself.

With these additions, you can adapt old loops straightforwardly, just by using the versions with "32". Or you can use a more efficient model for forward iteration, by using the "32PostInc" versions.


Open Issues:

  1. The names are only suggestions. In some cases it uses code point, in others char32. This should be resolved one way or another so that it is consistent. Code point is more explicit, but 32 make a short addition to names, like char32At. Overloading can be used, but wouldn't be sufficient in cases where only the return value is different.
  2. It is probably better to give StringBuffer.getCharCount(int offset) and Character.getCharCount(int char32) different names. While the target functionality is the same, there could be confusion since the arguments are both ints: one is an offset and one is a code point.
  3. If char32At() cannot go on CharSequence, it might be worth having a CodePointSequence that extends CharSequence that adds a similar (e.g. small) number of methods such as char32At(). String and StringBuffer could then implement CodePointSequence.
  4. What to do about CharacterIterator (see above).