ICU API Proposal

Search Mechanism for Unicode Strings

 

Introduction

 

As of ICU4C release 1.8.1, the collation module is UCA (Unicode Collation Algorithm) conformant. The collation module is able to perform language-sensitive comparison between two Unicode strings and even generate sort keys that can be easily binary compared. For instance:

 

Characters ß and SS will be compared equals on primary strength using a German collator.

 

What if a user would like to look for the non-case-sensitive match of the pattern “ß” within the Unicode string “abcSSdef”? The C API UChar * u_strstr(const UChar *s, const UChar *substring) and the C++ API UTextOffset UnicodeString::indexOf(const UnicodeString& text) would not find it, since both does a bitwise comparison. ICU4C does not have the appropriate APIs to perform a language-sensitive search.

 

The Java class com.ibm.text.StringSearch belonging to the ICU4J libraries is a search iterator that provides such a search function missing from ICU4C. com.ibm.text.StringSearch follows the rules defined in a java.text.RuleBasedCollator object (Java’s version of a collator) for matching. To optimize, it uses a version of the Boyer-Moore search algorithm adapted to work with Unicode strings. For more information, see “Efficient Text Searching in Java” published in the Java Report February 1999, and the com.ibm.text.StringSearch APIs located at http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/StringSearch.html.

 

Using ICU4J as a guide, our goal in this proposal is to provide a set of C and C++ APIs that takes a Unicode string as a pattern and looks for its match within another Unicode string, using the rules defined in a given collator. The proposed C and C++ API will have exactly the same functionalities.

 

Proposal

 

Since ICU’s C and C++ search mechanism plans to provide exactly the same functionality, for the benefit of clarity, we will treat them as one single entity and name the entity StringSearch.

 

Henceforth, any mention of pattern, refers to the Unicode text string to be matched. And any mention of text, refers to the Unicode text string in which is to be searched for pattern.

 

The StringSearch Model

Two strings match when given a collator, if ucol_strcoll == UCOL_EQUAL. String searching is a little different. A pattern matches at the offsets <start, end> in a text given a collator, if some sub-string between start and end, compares equivalent to the pattern.

So in Danish (where "aa" matches "å") "baad" will match "a--båd--man" at position <3,5>.

Below is a list of the exceptional cases and the steps that will be taken to handle them.

  1. Fuzzy Endings

Suppose a collator with ignored punctuation is being used. Then "baad" will match "a--båd--man" not only at start = 3, but also at 2 and 1. And end could be 5, 6, or 7. So a condition will be added: that if the pattern matches at <start, end> and also at <start+1, end>, we will disregard <start, end>. Similarly, if it matches at <start, end> and <start, end-1>, we will disregard <start, end>.

So in Danish (with ignored punctuation) "baad" will match "a--båd--man" ONLY at position <3,5>.

  1. Thai and Lao Character Boundaries

In collation, certain Thai and Lao vowels are swapped with the following character. For example, the text string “A\u0E02\u0E40” is modified internally in collation to “A\u0E40\u0E02”.

Looking for “A\u0E40” in “A\u0E02\u0E40” would find a ready match at the starting offset 0. The problem is in deciding which ending offset to return. Since users do not have access to the internal collation structure which handles the swapped string, the return offset for the example will be <0, 2>, even though the range would encompass the extra character ‘\u0E02’.

In short, if a pattern “P0 P1 .. Pn” is searched for in the text “T0 T1 .. Tm P0 P1 .. Pn-1 Tm+n Pn Tm+n+2  Tm+n+3…”, where ‘Tm+n ’ and ‘Pn‘ are to be swapped, a match will be found at <m+1, m+n+1>. Similarly, if a pattern “P0 P1 .. Pn” is searched for in the text “T0 T1 .. Tm-1 P0 Tm P1 P2 .. Pn Tm+n+1  Tm+n+2…”, where ‘P0‘and ‘Tm ’ is to be swapped, a match will be found at <m, m+n>.

  1. Normalization Boundaries

Normalization, like the swapping of characters described above, is done internally within collation. For the same reason that the internal collation structure is inaccessible to the user, the offsets for the result matches may contain extra combining characters near the beginning or the end of the match.

If the start of the match lies within a range of normalized characters, the start offset returned will be one character after the immediate preceding base character. If the end of the match lies within a range of normalized characters, the end offset returned will be one character before the immediate following base character.

Example: “\u0300\0325” has a match in “A\u0300\u0301\u0326\u0325B” at <1, 4>.

  1. Contraction Boundaries

Suppose that the collator has contractions, and that a contraction spans the boundary of the match. Then it will not be regarded as a match.

So in Traditional Spanish (where "ch" is a contraction), "har" will not match in "uno charo".

Boundaries that are discontiguous contractions will yield a match result similar to those described in the above normalization boundaries, whereby the end of the match returned will be one character before the immediate following base character. However, if the pattern consists of prefix accents and a match with a starting discontiguous boundary is found, the result start offset of the match could include the initial base character in the discontiguous contraction or not depending on the direction of search.

Example: Searching for the accent “\u0319” in "X\u0300\u0319\u0315", where “X\u0300\u0315” is a contraction, gives a match at <0, 4> during a forward search but <1, 3> during a backward search.

  1. Double Matches

Consider the search pattern “\u0300” and the text “A\u0300\u0300”, due to normalization, the first and only match will be found at <1, 2>. Even though there is another match within <1, 2>, the same match result will only be returned once during the entire search iteration.

  1. Pre-composed Characters

Pre-composed characters are treated as their decomposed NFD counterparts. If “\u0300” is looked for in the string “\u00C0”, a match will be found at <0, 1>.

Currently, there is no existing pre-composed character that decomposes in NFD to a character sequence with accents before a base character. StringSearch makes use of this fact for boundary checks optimization. In the event, where this is no longer true, StringSearch will be updated to handle it.

  1. User Specified Conditions

Suppose that "c" is the pattern; is there a match in "c`" (e.g. with a combining grave accent)? The above example will always match, however the behaviour can overwritten if a break iterator is provided to test the boundary. Thus if "c" in "abcd" is not to match, supply a word-break iterator. If "c" in "abc`" is not to match, provide a character-break iterator.

  1. Canonical Equivalence

Collation guarantees that any strings will be compared under canonical equivalence if normalization is on, but even if it is off, FCD text (http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm#CheckFCD) is guaranteed to sort correctly. This works fine in the interior of patterns and the text sub-strings they match, since collation in ICU4C is done properly, even with discontiguous contractions.

There is a bit of a problem, though, at the boundaries. When pattern: "¸c`" (combining cedilla, c, combining grave) is searched in text: "a^¸c`˛e" (a, combining circumflex, combining underdot, c, combining grave, combining ogonek, e), a match will be found. But if the same text is normalized, the result will be "a¸^c˛`e". In that normalized version and no further processing, a match will not be found.

In order to ensure canonical equivalence, the same answer should result in either case. Hence the user is provided with two options to select for searching:

Accents refers to characters with non-zero canonical combining order and non-zero collation elements. Remember that not all non-zero canonical combining order characters are primary-ignorables, and vice versa.

A discontiguous match might occur in option 1., in which case the match result <start, end> may encompass a more accents at the ends of the match result than it is required. For instance: When the pattern "^c˛" is searched in "a^¸c`˛e" with normalization on, a match is found. But the characters that match have gaps "a^¸c`˛e".

Option 2. is more restrictive. It would permit a search for an Arabic consonant, and match against consonant + vowel. But, if a consonant + vowel1 is searched, it would not match against consonant + vowel1 + vowel2.

Hence this leads us to the below definition of a match in StringSearch.

 

 

Definition of a match

 

Let S’ be the sub-string of S at the offsets <start, end>.

A pattern string P is considered to match a text string S at the offsets start and end, <start, end> if and only if either

  1. ucol_strcoll(user specified collator, some canonical equivalent of P, some canonical equivalent of S’) == UCOL_EQUAL. or
  2. ucol_strcoll(user specified collator, P, S’) == UCOL_EQUAL and if P starts or ends with a combining mark, there exists no non-ignorable combining mark before or after S’ in S respectively.

 

The Goals for StringSearch

 

However APIs will not be provided to set and get the collator attributes. The caller would have to retrieve the collator from StringSearch and sets them using the collation APIs.


 

Proposed C header and API

 

/**

 * C Apis for an engine that provides language-sensitive text searching based

 * on the comparison rules defined in a <tt>UCollator</tt> data struct,

 * see <tt>ucol.h</tt>. This ensures that language eccentricity can be

 * handled, e.g. for the German collator, characters ß and SS will be matched

 * if case is chosen to be ignored.

 * See the <a href=http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm>

 * "ICU Collation Design Document"</a> for more information.

 * <p>

 * The algorithm implemented is a modified form of the Boyer Moore's search.

 * For more information  see

 * <a href=http://www.icu-project.org/docs/papers/text-search.html>

 * "Efficient Text Searching in Java"</a>, published in <i>Java Report</i>

 * in February, 1999, for further information on the algorithm.

 * <p>

 * There are 2 match options for selection:<br>

 * Let S' be the sub-string of a text string S between the offsets start and

 * end <start, end>.

 * <br>

 * A pattern string P matches a text string S at the offsets <start, end>

 * if

 * <pre>

 * option 1. Some canonical equivalent of P matches some canonical equivalent

 *           of S'

 * option 2. P matches S' and if P starts or ends with a combining mark,

 *           there exists no non-ignorable combining mark before or after S’

 *           in S respectively.

 * </pre>

 * Option 2. will be the default·

 * <p>

 * This search has APIs similar to that of other text iteration mechanisms

 * such as the break iterators in <tt>ubrk.h</tt>. Using these

 * APIs, it is easy to scan through text looking for all occurances of

 * a given pattern. This search iterator allows changing of direction by

 * calling a <tt>reset</tt> followed by a <tt>next</tt> or <tt>previous</tt>.

 * Though a direction change can occur without calling <tt>reset</tt> first, 

 * this operation comes with some speed penalty.

 * Generally, match results in the forward direction will match the result

 * matches in the backwards direction in the reverse order

 * <p>

 * <tt>usearch.h</tt> provides APIs to specify the starting position

 * within the text string to be searched, e.g. <tt>usearch_setOffset</tt>,

 * <tt>usearch_preceding</tt> and <tt>usearch_following</tt>. Since the

 * starting position will be set as it is specified, please take note that

 * there are some dangerous positions which the search may render incorrect

 * results:

 * <ul>

 * <li> The midst of a substring that requires normalization.

 * <li> If the following match is to be found, the position should not be the

 *      second character which requires to be swapped with the preceding

 *      character. Vice versa, if the preceding match is to be found,

 *      position to search from should not be the first character which

 *      requires to be swapped with the next character. E.g certain Thai and

 *      Lao characters require swapping.

 * <li> If a following pattern match is to be found, any position within a

 *      contracting sequence except the first will fail. Vice versa if a

 *      preceding pattern match is to be found, a invalid starting point

 *      would be any character within a contracting sequence except the last.

 * <\ul>

 * <p>

 * A breakiterator can be used if only matches at logical breaks are desired.

 * <p>

 * Options are provided to handle overlapping matches.

 * E.g. In English, overlapping matches produces the result 0 and 2

 * for the pattern "abab" in the text "ababab", where else mutually

 * exclusive matches only produce the result of 0.

 * <p>

 * Though collator attributes will be taken into consideration while

 * performing matches, there are no APIs here for setting and getting the

 * attributes. These attributes can be set by getting the collator

 * from <tt>usearch_getCollator</tt> and using the APIs in <tt>ucol.h</tt>.

 * <p>

 * Restriction: <br>

 * Currently there are no composite characters that consists of a

 * character with combining class > 0 before a character with combining

 * class == 0. However, if such a character exists in the future, the

 * search mechanism does not guarantee the results for option 1.

 *

 * <p>

 * Example of use:<br>

 * <pre><code>

 * char *tgtstr = "The quick brown fox jumped over the lazy fox";

 * char *patstr = "fox";

 * UChar target[64];

 * UChar pattern[16];

 * UErrorCode status = U_ZERO_ERROR;

 * u_uastrcpy(target, tgtstr);

 * u_uastrcpy(pattern, patstr);

 *

 * UStringSearch *search = usearch_open(pattern, -1, target, -1, "en_US",

 *                                  &status);

 * if (U_SUCCESS(status)) {

 *     for (int pos = usearch_first(search);

 *                                      pos != USEARCH_DONE;

 *                                      pos = usearch_next(search)) {

 *         printf("Found match at %d pos, length is %d\n", pos,

 *                                        usearch_getMatchLength(search));

 *     }

 * }

 * </code></pre>

 */

 

/**

* DONE is returned by previous() and next() after all valid matches have

* been returned, and by first() and last() if there are no matches at all.

*/

#define USEARCH_DONE -1

 

/**

* Data structure for searching

*/

struct UStringSearch;

typedef struct UStringSearch UStringSearch;

 

typedef enum {

    /** Option for overlapping matches */

    USEARCH_OVERLAP,

    /**

    Option for canonical matches. option 1 in header documentation.

    The default value will be USEARCH_OFF

    */

    USEARCH_CANONICAL_MATCH,

    USEARCH_ATTRIBUTE_COUNT

} USearchAttribute;

 

typedef enum {

    /** default value for any USearchAttribute */

    USEARCH_DEFAULT = -1,

    /** value for USEARCH_OVERLAP and USEARCH_CANONICAL_MATCH */

    USEARCH_OFF,

    /** value for USEARCH_OVERLAP and USEARCH_CANONICAL_MATCH */

    USEARCH_ON,

    USEARCH_ATTRIBUTE_VALUE_COUNT

} USearchAttributeValue;

 

/* open and close ------------------------------------------------------ */

 

/**

* Creating a search iterator data struct using the argument locale language

* rule set. A collator will be created in the process, which will be owned by

* this search and will be deleted in <tt>usearch_close</tt>.

* @param pattern for matching

* @param patternlength length of the pattern, -1 for null-termination

* @param text text string

* @param textlength length of the text string, -1 for null-termination

* @param locale name of locale for the rules to be used

* @param breakiter A BreakIterator that will be used to restrict the points

*                  at which matches are detected. If a match is found, but

*                  the match's start or end index is not a boundary as

*                  determined by the <tt>BreakIterator</tt>, the match will

*                  be rejected and another will be searched for.

*                  If this parameter is <tt>NULL</tt>, no break detection is

*                  attempted.

* @param status for errors if it occurs

* @return search iterator data structure

*/

U_CAPI UStringSearch * usearch_open(const UChar          *pattern,

                                          int32_t         patternlength,

                                    const UChar          *text,

                                          int32_t         textlength,

                                    const char           *locale,

                                          UBreakIterator *breakiter,

                                          UErrorCode     *status);

 

/**

* Creating a search iterator data struct using the argument collator language

* rule set. Note, user retains the ownership of this collator, thus the

* responsibility of deletion lies with the user.

* @param pattern for matching

* @param patternlength length of the pattern, -1 for null-termination

* @param text text string

* @param textlength length of the text string, -1 for null-termination

* @param collator used for the language rules

* @param breakiter A BreakIterator that will be used to restrict the points

*                  at which matches are detected. If a match is found, but

*                  the match's start or end index is not a boundary as

*                  determined by the <tt>BreakIterator</tt>, the match will

*                  be rejected and another will be searched for.

*                  If this parameter is <tt>NULL</tt>, no break detection is

*                  attempted.

* @param status for errors if it occurs

* @return search iterator data structure

*/

U_CAPI UStringSearch * usearch_openFromCollator(const UChar *pattern,

                                               int32_t         patternlength,

                                         const UChar          *text,

                                               int32_t         textlength,

                                         const UCollator      *collator,

                                               UBreakIterator *breakiter,

                                               UErrorCode     *status);

 

/**

* Destroying and cleaning up the search iterator data struct.

* If a collator is created in usearch_open, it will be destroyed here.

* @param searchiter data struct to clean up

*/

U_CAPI void usearch_close(UStringSearch *strsrch);

 

/* get and set methods -------------------------------------------------- */

 

/**

* Sets the current position in the text string which the next search will

* start from. Clears previous states.

* This method takes the argument index and sets the position in the text

* string accordingly without checking if the index is pointing to a

* valid starting point to begin searching.

* Search positions that may render incorrect results are highlighted in the

* header comments

* @param strsrch search iterator data struct

* @param position position to start next search from.

* @param status error status if any.

*/

U_CAPI void usearch_setOffset(UStringSearch *strsrch, UTextOffset position,

                              UErrorCode    *status);

 

/**

* Return the current index in the string text being searched.

* If the iteration has gone past the end of the text (or past the beginning

* for a backwards search), {@link #USEARCH_DONE} is returned.

* @param strsrch search iterator data struct

*/

U_CAPI UTextOffset usearch_getOffset(const UStringSearch *strsrch);

   

/**

* Sets the text searching attributes located in the enum USearchAttribute

* with values from the enum USearchAttributeValue.

* USEARCH_DEFAULT can be used for all attributes for resetting.

* @param strsrch search iterator data struct

* @param attribute text attribute to be set

* @param value text attribute value

* @param status for errors if it occurs

* @see #usearch_getAttribute

*/

U_CAPI void usearch_setAttribute(UStringSearch         *strsrch,

                                 USearchAttribute       attribute,

                                 USearchAttributeValue  value,

                                 UErrorCode            *status);

 

/**   

* Gets the text searching attributes.

* @param strsrch search iterator data struct

* @param attribute text attribute to be retrieve

* @return text attribute value

* @see #usearch_setAttribute

*/

U_CAPI USearchAttributeValue usearch_getAttribute(

                                         const UStringSearch    *strsrch,

                                               USearchAttribute  attribute);

 

/**

* Returns the index to the match in the text string that was searched.

* This call returns a valid result only after a successful call to

* {@link #usearch_first}, {@link #usearch_next},

* {@link #usearch_previous}, or {@link #usearch_last}.

* Just after construction, or after a searching method returns

* <tt>USEARCH_DONE</tt>, this method will return <tt>USEARCH_DONE</tt>.

* <p>

* Use usearch_getMatchedLength to get the matched string length.

* @param strsrch search iterator data struct

* @return index to a substring within the text string that is being

*         searched.

*/

U_CAPI UTextOffset usearch_getMatchedStart(const UStringSearch *strsrch);

   

/**

* Returns the length of text in the string which matches the search pattern.

* This call returns a valid result only after a successful call to

* {@link #usearch_first}, {@link #usearch_next},

* {@link #usearch_previous}, or {@link #usearch_last}.

* Just after construction, or after a searching method returns

* <tt>USEARCH_DONE</tt>, this method will return 0.

* @param strsrch search iterator data struct

* @return The length of the match in the string text, or 0 if there is no

*         match currently.

*/

U_CAPI int32_t usearch_getMatchedLength(const UStringSearch *strsrch);

 

/**

* Returns the text that was matched by the most recent call to

* {@link #usearch_first}, {@link #usearch_next},

* {@link #usearch_previous}, or {@link #usearch_last}.

* If the iterator is not pointing at a valid match (e.g. just after

* construction or after <tt>USEARCH_DONE</tt> has been returned, returns

* an empty string. If result is not large enough to store the matched text,

* result will be filled with the partial text and an U_BUFFER_OVERFLOW_ERROR

* will be returned in status. result will be null-terminated whenever

* possible. If the buffer fits the matched text exactly, a null-termination

* is not possible, then a U_STRING_NOT_TERMINATED_ERROR set in status.

* Pre-flighting can be either done with length = 0 or the API

* usearch_getMatchLength().

* @param strsrch search iterator data struct

* @param result UChar buffer to store the matched string

* @param resultCapacity length of the result buffer

* @param status error returned if result is not large enough

* @return exact length of the matched text, not counting the null-termination

*/

U_CAPI int32_t usearch_getMatchedText(const UStringSearch *strsrch,

                                            UChar         *result,

                                            int32_t        resultCapacity,

                                            UErrorCode    *status);

 

/**

* Set the BreakIterator that will be used to restrict the points at which

* matches are detected.

* @param strsrch search iterator data struct

* @param breakiter A BreakIterator that will be used to restrict the points

*                  at which matches are detected. If a match is found, but

*                  the match's start or end index is not a boundary as

*                  determined by the <tt>BreakIterator</tt>, the match will

*                  be rejected and another will be searched for.

*                  If this parameter is <tt>NULL</tt>, no break detection is

*                  attempted.

* @param status for errors if it occurs

* @see #usearch_getBreakIterator

*/

U_CAPI void usearch_setBreakIterator(UStringSearch  *strsrch,

                                     UBreakIterator *breakiter,

                                     UErrorCode     *status);

   

/**

* Returns the BreakIterator that is used to restrict the points at which

* matches are detected. This will be the same object that was passed to the

* constructor or to <tt>usearch_setBreakIterator</tt>. Note that

* <tt>NULL</tt>

* is a legal value; it means that break detection should not be attempted.

* @param strsrch search iterator data struct

* @return break iterator used

* @see #usearch_setBreakIterator

*/

U_CAPI const UBreakIterator * usearch_getBreakIterator(

                                              const UStringSearch *strsrch);

   

/**

* Set the string text to be searched. Text iteration will hence begin at the

* start of the text string. This method is useful if you want to re-use an

* iterator to search for the same pattern within a different body of text.

* @param strsrch search iterator data struct

* @param text new string to look for match

* @param textlength length of the new string, -1 for null-termination

* @param status for errors if it occurs

* @see #usearch_getText

*/

U_CAPI void usearch_setText(      UStringSearch *strsrch,

                            const UChar         *text,

                                  int32_t        textlength,

                                  UErrorCode    *status);

 

/**

* Return the string text to be searched.

* @param strsrch search iterator data struct

* @param length returned string text length

* @return string text

* @see #usearch_setText

*/

U_CAPI const UChar * usearch_getText(const UStringSearch *strsrch,

                                           int32_t       *length);

 

/**

* Gets the collator used for the language rules.

* <p>

* Deleting the returned <tt>UCollator</tt> before calling

* <tt>usearch_close</tt> would cause the string search to fail.

* <tt>usearch_close</tt> will delete the collator if this search owns it.

* @param strsrch search iterator data struct

* @return collator

*/

U_CAPI UCollator * usearch_getCollator(const UStringSearch *strsrch);

 

/**

* Sets the collator used for the language rules. User retains the ownership

* of this collator, thus the responsibility of deletion lies with the user.

* This method causes internal data such as Boyer-Moore shift tables to 

* be recalculated, but the iterator's position is unchanged.

* @param strsrch search iterator data struct

* @param collator to be used

* @param status for errors if it occurs

*/

U_CAPI void usearch_setCollator(      UStringSearch *strsrch,

                                const UCollator     *collator,

                                      UErrorCode    *status);

 

/**

* Sets the pattern used for matching.

* Internal data like the Boyer Moore table will be recalculated, but the

* iterator's position is unchanged.

* @param strsrch search iterator data struct

* @param pattern string

* @param patternlength pattern length, -1 for null-terminated string

* @param status for errors if it occurs

*/

U_CAPI void usearch_setPattern(      UStringSearch *strsrch,

                               const UChar         *pattern,

                                     int32_t        patternlength,

                                     UErrorCode    *status);

 

/**

* Gets the search pattern

* @param strsrch search iterator data struct

* @param length return length of the pattern, -1 indicates that the pattern

*               is null-terminated

* @return pattern string

*/

U_CAPI const UChar * usearch_getPattern(const UStringSearch *strsrch,

                                              int32_t       *length);

 

/* methods ------------------------------------------------------------- */

 

/**

* Returns the first index at which the string text matches the search

* pattern. 

* The iterator is adjusted so that its current index (as returned by

* {@link #usearch_getOffset}) is the match position if one was found.

* If a match is not found, <tt>USEARCH_DONE</tt> will be returned and

* the iterator will be adjusted to the index USEARCH_DONE.

* @param strsrch search iterator data struct

* @param status for errors if it occurs

* @return The character index of the first match, or

* <tt>USEARCH_DONE</tt> if there are no matches.

*/

U_CAPI UTextOffset usearch_first(UStringSearch *strsrch, UErrorCode *status);

 

/**

* Returns the first index greater than <tt>position</tt> at which the string

* text

* matches the search pattern. The iterator is adjusted so that its current

* index (as returned by {@link #usearch_getOffset}) is the match position if

* one was found.

* If a match is not found, <tt>USEARCH_DONE</tt> will be returned and

* the iterator will be adjusted to the index USEARCH_DONE

* <p>

* Search positions that may render incorrect results are highlighted in the

* header comments.

* @param strsrch search iterator data struct

* @param position to start the search at

* @param status for errors if it occurs

* @return The character index of the first match following <tt>pos</tt>,

*         or <tt>USEARCH_DONE</tt> if there are no matches.

*/

U_CAPI UTextOffset usearch_following(UStringSearch *strsrch,

                                 UTextOffset    position,

                                 UErrorCode    *status);

   

/**

* Returns the last index in the target text at which it matches the search

* pattern. The iterator is adjusted so that its current

* index (as returned by {@link #usearch_getOffset}) is the match position if

* one was found.

* If a match is not found, <tt>USEARCH_DONE</tt> will be returned and

* the iterator will be adjusted to the index USEARCH_DONE.

* @param strsrch search iterator data struct

* @param status for errors if it occurs

* @return The index of the first match, or <tt>USEARCH_DONE</tt> if there

*         are no matches.

*/

U_CAPI UTextOffset usearch_last(UStringSearch *strsrch, UErrorCode *status);

 

/**

* Returns the first index less than <tt>position</tt> at which the string text

* matches the search pattern. The iterator is adjusted so that its current

* index (as returned by {@link #usearch_getOffset}) is the match position if

* one was found.

* If a match is not found, <tt>USEARCH_DONE</tt> will be returned and

* the iterator will be adjusted to the index USEARCH_DONE

* <p>

* Search positions that may render incorrect results are highlighted in the

* header comments.

* @param strsrch search iterator data struct

* @param position index position the search is to begin at

* @param status for errors if it occurs

* @return The character index of the first match preceding <tt>pos</tt>,

*         or <tt>USEARCH_DONE</tt> if there are no matches.

*/

U_CAPI UTextOffset usearch_preceding(UStringSearch *strsrch,

                                 UTextOffset    position,

                                 UErrorCode    *status);

   

/**

* Returns the index of the next point at which the string text matches the

* search pattern, starting from the current position.

* The iterator is adjusted so that its current

* index (as returned by {@link #usearch_getOffset}) is the match position if

* one was found.

* If a match is not found, <tt>USEARCH_DONE</tt> will be returned and

* the iterator will be adjusted to the index USEARCH_DONE

* @param strsrch search iterator data struct

* @param status for errors if it occurs

* @return The index of the next match after the current position, or

*         <tt>USEARCH_DONE</tt> if there are no more matches.

* @see #usearch_first

*/

U_CAPI UTextOffset usearch_next(UStringSearch *strsrch, UErrorCode *status);

 

/**

* Returns the index of the previous point at which the string text matches

* the search pattern, starting at the current position.

* The iterator is adjusted so that its current

* index (as returned by {@link #usearch_getOffset}) is the match position if

* one was found.

* If a match is not found, <tt>USEARCH_DONE</tt> will be returned and

* the iterator will be adjusted to the index USEARCH_DONE

* @param strsrch search iterator data struct

* @param status for errors if it occurs

* @return The index of the previous match before the current position,

*         or <tt>USEARCH_DONE</tt> if there are no more matches.

*/

U_CAPI UTextOffset usearch_previous(UStringSearch *strsrch, UErrorCode *status);

   

/**

* Reset the iteration.

* Search will begin at the start of the text string if a forward iteration

* is initiated before a backwards iteration. Otherwise if a backwards

* iteration is initiated before a forwards iteration, the search will begin

* at the end of the text string.

* @param strsrch search iterator data struct

*/

U_CAPI void usearch_reset(UStringSearch *strsrch);


 

Proposed C++ class design and API

 

There are 2 classes are designed to support StringSearch:

SearchIterator, an abstract base class which all search classes are to extend from, and StringSearch, the concrete subclass of SearchIterator that provides language-sensitive text searching based on the comparison rules defined RuleBasedCollator.

 

#include "unicode/unistr.h"

#include "unicode/chariter.h"

#include "unicode/brkiter.h"

#include "unicode/usearch.h"

 

/**

 * <tt>SearchIterator</tt> is an abstract base class that provides

 * methods to search for a pattern within a text string. Instances of

 * <tt>SearchIterator</tt> maintain a current position and scans over the

 * target text, returning the indices the pattern is matched and the length

 * of each match.

 * <p>

 * <tt>SearchIterator</tt> defines a protocol for text searching.

 * Subclasses provide concrete implementations of various search algorithms.

 * For example, {@link StringSearch} implements language-sensitive pattern

 * matching based on the comparison rules defined in a

 * {@link RuleBasedCollator} object.

 * <p>

 * Other options for searching includes using a BreakIterator to restrict

 * the points at which matches are detected.

 * <p>

 * <tt>SearchIterator</tt> provides an API that is similar to that of

 * other text iteration classes such as <tt>BreakIterator</tt>. Using

 * this class, it is easy to scan through text looking for all occurances of

 * a given pattern. The following example uses a <tt>StringSearch</tt>

 * object to find all instances of "fox" in the target string. Any other

 * subclass of <tt>SearchIterator</tt> can be used in an identical

 * manner.

 * <pre><code>

 * UnicodeString target("The quick brown fox jumped over the lazy fox");

 * UnicodeString pattern("fox");

 *

 * SearchIterator *iter = new StringSearch(pattern, target);

 *

 * for (int pos = iter->first(); pos != USEARCH_DONE;

 *                               pos = iter->next()) {

 *     printf("Found match at %d pos, length is %d\n", pos,

 *                                             iter.getMatchLength());

 * }

 * </code></pre>

 *

 * @see StringSearch

 */

 

struct USearch;

typedef struct USearch USearch;

 

/**

* Data structure for searching

*/

class U_I18N_API SearchIterator {

 

public:

 

    // public constructors and destructors -------------------------------

 

    /**

    * Copy constructor that creates a SearchIterator instance with the same

    * behavior, and iterating over the same text.

    * @param other the SearchIterator instance to be copied.

    */

    SearchIterator(const SearchIterator &other);

 

    /**

     * Destructor. Cleans up the search iterator data struct.

     */

    virtual ~SearchIterator();

 

    // public get and set methods ----------------------------------------

 

    /**

     * Sets the index to point to the given position, and clears any state

     * that's affected.

     * <p>

     * This method takes the argument index and sets the position in the text

     * string accordingly without checking if the index is pointing to a

     * valid starting point to begin searching.

     * @param position within the text to be set

     * @param status for errors if it occurs

     */

    virtual void setOffset(UTextOffset position, UErrorCode &status) = 0;

 

    /**

     * Return the current index in the text being searched.

     * If the iteration has gone past the end of the text

     * (or past the beginning for a backwards search), {@link #USEARCH_DONE}

     * is returned.

     * @return current index in the text being searched.

     */

    virtual UTextOffset getOffset(void) const = 0;

 

    /**

    * Sets the text searching attributes located in the enum

    * USearchAttribute with values from the enum USearchAttributeValue.

    * USEARCH_DEFAULT can be used for all attributes for resetting.

    * @param attribute text attribute (enum USearchAttribute) to be set

    * @param value text attribute value

    * @param status for errors if it occurs

    */

    void setAttribute(USearchAttribute       attribute,

                      USearchAttributeValue  value,

                      UErrorCode            &status);

 

    /**   

    * Gets the text searching attributes

    * @param attribute text attribute (enum USearchAttribute) to be retrieve

    * @return text attribute value

    */

    USearchAttributeValue getAttribute(USearchAttribute  attribute) const;

   

    /**

    * Returns the index to the match in the text string that was searched.

    * This call returns a valid result only after a successful call to

    * {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.

    * Just after construction, or after a searching method returns

    * <tt>USEARCH_DONE</tt>, this method will return <tt>USEARCH_DONE</tt>.

    * <p>

    * Use getMatchedLength to get the matched string length.

    * @return index of a substring within the text string that is being

    *         searched.

    */

    UTextOffset getMatchedStart(void) const;

 

    /**

     * Returns the length of text in the string which matches the search

     * pattern. This call returns a valid result only after a successful call

     * to {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.

     * Just after construction, or after a searching method returns

     * <tt>USEARCH_DONE</tt>, this method will return 0.

     * @return The length of the match in the target text, or 0 if there

     *         is no match currently.

     */

    int32_t getMatchedLength(void) const;

   

    /**

     * Returns the text that was matched by the most recent call to

     * {@link #first}, {@link #next}, {@link #previous}, or {@link #last}.

     * If the iterator is not pointing at a valid match (e.g. just after

     * construction or after <tt>USEARCH_DONE</tt> has been returned,

     * returns an empty string.

     * @param result stores the matched string or an empty string if a match

     *        is not found.

     */

    void getMatchedText(UnicodeString &result) const;

   

    /**

     * Set the BreakIterator that will be used to restrict the points

     * at which matches are detected. The user is responsible for deleting

     * the breakiterator.

     * @param breakiter A BreakIterator that will be used to restrict the

     *                points at which matches are detected. If a match is

     *                found, but the match's start or end index is not a

     *                boundary as determined by the <tt>BreakIterator</tt>,

     *                the match will be rejected and another will be searched

     *                for. If this parameter is <tt>NULL</tt>, no break

     *                detection is attempted.

     * @param  status for errors if it occurs

     */

    void setBreakIterator(BreakIterator *breakiter, UErrorCode &status);

   

    /**

     * Returns the BreakIterator that is used to restrict the points at

     * which matches are detected.  This will be the same object that was

     * passed to the constructor or to <tt>setBreakIterator</tt>.

     * Note that <tt>NULL</tt> is a legal value; it means that break

     * detection should not be attempted.

     * @return BreakIterator used to restrict matchings.

     */

    const BreakIterator * getBreakIterator(void) const;

 

    /**

     * Set the string text to be searched. Text iteration will hence begin at

     * the start of the text string. This method is useful if you want to

     * re-use an iterator to search for the same pattern within a different

     * body of text. The user is responsible for deleting the text.

     * @param text string to be searched.

     * @param status for errors if it occurs

     */

    virtual void setText(const UnicodeString &text, UErrorCode &status);   

 

    /**

     * Set the string text to be searched. Text iteration will hence begin at

     * the start of the text string. This method is useful if you want to

     * re-use an iterator to search for the same pattern within a different

     * body of text.

     * <p>

     * Note: No parsing of the text within the <tt>CharacterIterator</tt>

     * will be done during searching for this version. The block of text

     * in <tt>CharacterIterator</tt> will be used as it is.

     * The user is responsible for deleting the text.

     * @param text string iterator to be searched.

     * @param  status for errors if it occurs

     */

    virtual void setText(CharacterIterator &text, UErrorCode &status);

   

    /**

     * Return the string text to be searched.

     * @return text string to be searched.

     */

    const UnicodeString & getText(void) const;

 

    // operator overloading ----------------------------------------------

 

    /**

     * Equality operator.

     * @param that SearchIterator instance to be compared.

     * @return TRUE if both BreakIterators are of the same class, have the

     *         same behavior, terates over the same text and have the same

     *         attributes. FALSE otherwise.

     */

    virtual UBool operator==(const SearchIterator &that) const;

 

    /**

     * Not-equal operator.

     * @param that SearchIterator instance to be compared.

     * @return FALSE if operator== returns TRUE, and vice versa.

     */

    UBool operator!=(const SearchIterator &that) const;

 

    // public methods ----------------------------------------------------

 

    /**

     * Returns a copy of SearchIterator with the same behavior, and

     * iterating over the same text, as this one. Note that all data will be

     * replicated, except for the text string to be searched.

     * @return cloned object

     */

    virtual SearchIterator* safeClone(void) const = 0;

 

    /**

     * Returns the first index at which the string text matches the search

     * pattern. The iterator is adjusted so that its current index (as

     * returned by {@link #usearch_getOffset}) is the match position if one

     * was found.

     * If a match is not found, <tt>USEARCH_DONE</tt> will be returned and

     * the iterator will be adjusted to the index USEARCH_DONE

     * @param  status for errors if it occurs

     * @return The character index of the first match, or

     *         <tt>USEARCH_DONE</tt> if there are no matches.

     */

    UTextOffset first(UErrorCode &status);

 

    /**

     * Returns the first index greater than <tt>position</tt> at which the

     * string text matches the search pattern. The iterator is adjusted so

     * that its current index (as returned by {@link #getOffset}) is the

     * match position if one was found. If a match is not found,

     * <tt>USEARCH_DONE</tt> will be returned and the iterator will be

     * adjusted to the index USEARCH_DONE