BOCU-1

A MIME-compatible application of the Binary Ordered Compression for Unicode base algorithm.

Markus Scherer & Mark Davis
San José 2002-jan-29

Summary

This document describes a Unicode encoding for the storage and exchange of text data. It is stateful and provides a good byte/code point ratio while being directly usable in SMTP emails, database fields and other contexts.


Note that Unicode Technical Note #6 supersedes this design document. It is complete with sample code and a Windows executable but does not contain the detailed analysis at the bottom of this document.
Markus 2002-dec-17


Introduction

The most popular encoding for Unicode text data exchange is UTF-8. It is relatively simple and widely applicable: MIME/email/HTML/XML, in-process use in some systems, etc. However, UTF-8 uses more bytes per code point for non-Latin scripts than language-specific codepages.

In some markets where scripts other than Latin are used, the high bytes/code point ratio of UTF-8 (and of UTF-16 for scripts other than Latin and CJK ideographs) has been criticized and used as a motivation to not use Unicode for the exchange of documents in some languages. Larger document sizes are also a problem in low-bandwidth environments such as wireless networks for handheld systems.

SCSU was created as a Unicode compression scheme with a byte/code point ratio similar to language-specific codepages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered as one. SCSU is not suitable for MIME "text" media types, i.e., it cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance.

BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU. It is useful for short strings and maintains code point order.

Basic features of BOCU-1

BOCU-1 is a stateful, byte-oriented, encoding of Unicode. Like SCSU it can be classified as a Character Encoding Scheme (CES) or as a Transfer Encoding Syntax (TES). It is a "charset" according to IANA, and it is suitable for MIME "text" media types.

BOCU-1 is more complicated than UTF-8 but much simpler than SCSU. Its inter-character state consists of a single integer. It is deterministic, i.e., the same complete input text is always encoded the same way by all encoders.

The byte/code point ratio is 1 for runs of code points from the same block of 0x80 code points (and for Hiragana), and 2 for runs of CJK Unihan code points, as with SCSU. This is much better than UTF-8 for Indic scripts, Russian, Greek, Arabic, Hebrew, etc. The startup overhead is very low (similar to SCSU), which makes it useful for very short strings like individual names. The maximum number of bytes per code point is four.

The lexical order of BOCU-1 bytes is the same as the code point order of the original text — like UTF-8 but unlike SCSU — which allows the compression of large, sorted lists of strings.

The C0 control codes NUL, CR and LF and nine others are encoded with the same byte values as in US-ASCII, and those byte values are used only for the encoding of these twelve C0 control codes. This makes BOCU-1 suitable for MIME "text" media types, directly usable in emails and generally "friendly" for ASCII-based tools. The SUB control and its byte value 1A is included in this set to avoid problems in DOS/Windows/OS/2 systems where 1A is interpreted as an end-of-file marker in text mode.

The state is reset at each C0 control (U+0000..U+001F, includes CR, LF, TAB). CRLF-separated lines do not affect each other's encoding. Together with BOCU-1 being deterministic, this allows line-based file comparisons (diff) and makes BOCU-1 usable with RCS, CVS and other source-code control systems (unlike SCSU). This also allows some limited random access.

Byte values for single-byte codes and lead bytes overlap with trail bytes. So unlike UTF-8, character boundaries cannot be determined in random access, except by backing up to a reset point.

Byte values 7F..9F (DEL and C1 control codes) are used as lead and trail bytes.

US-ASCII characters (code points U+0021..U+007E) are not encoded with the same bytes as in US-ASCII. Therefore, the charset must be specified with a signature byte sequence or in a higher-level protocol.

As a single/lead byte, byte value FF is used as a special "reset-state-only" command. It does not encode any code point. FF is also a normal trail byte.
Having a "reset only" command allows simple concatenation of BOCU-1 byte streams. (All other BOCU-1 byte sequences would append some code point.) Using FF to reset the state breaks the ordering! The use of FF resets is discouraged..
Byte stream concatenation without resetting with FF requires to scan back to a C0 control whose byte value is not used for trail bytes (last known reset to initial state); then decode to the end of the first stream and encode the first non-U+0020 code point of the second stream according to the current state; then append the rest of the second stream. The same procedure could be used to remove an FF reset command.

Signature byte sequence

An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.

Encoding algorithm

The basic algorithm is as described in Binary Ordered Compression for Unicode.

BOCU-1 differs from the generic algorithm by using a different set of byte value ranges and by encoding U+0000..U+0020 directly with byte values 00..20. In addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of each word.

Partial pseudo-code for a per-code point encoding function is as follows:

encode(int &prev, int c) {
    if(c<=0x20) {
        output (byte)c;
        if(c!=0x20) {
            prev=0x40;
        }
    } else {
        int diff=c-prev;
        // encode diff in 1..4 bytes and output them

        // adjust prev
        if(c is Hiragana) {
            prev=middle of Hiragana;
        } else if(c is CJK Unihan) {
            prev=middle of CJK Unihan;
        } else if(c is Hangul) {
            prev=middle of Hangul;
        } else {
            prev=(c&~0x7f)+0x40;
        }
    }
}

Sample C code

The sample C code serves as the full specification of BOCU-1. Every conformant encoder and decoder must generate equivalent output and detect any illegal input code points and illegal input byte sequences. Recovery from illegal input is not specified. Single surrogates are encoded if present in the input (e.g., unmatched single surrogate code units in UTF-16). Proper input of supplementary code points (e.g., matched surrogate pairs in UTF-16) must be encoded by code points.

This code uses ICU standard headers and the one implementation file icu/source/common/utf_impl.c. (It is not necessary to link the entire ICU common library.) This is for convenience in the surrounding test functions and not necessary for the core BOCU-1 functions. These headers and implementation file provide the following:

This code is under the X license (ICU version).

Files:

A complete, compiled sample executable for Windows from this source code is available for download. See Unicode Technical Note #6 and the note at the top of this document. Aside from basic implementation and consistency tests, this also provides file conversion between UTF-8 and BOCU-1. Use a command-line argument of "?" or "-h" for usage.

Changes

2002-apr-12: Use single/lead byte value FF as "reset only" without code point output. FF is still also used as a trail byte.

2002-mar-20: Remove resetting at NL, LS, PS (see change from feb21). Reason: Resetting for them makes the algorithm more complicated without much gain. These characters are rarely used and their BOCU encoding still depends on the preceding code point, unlike with LF etc. The only gain would be when NL/LS/PS were used with US-ASCII text, where the first character on a line would need one byte instead of two.

2002-feb-21: Change adjustment of prev to account for Hiragana (not 128-aligned) and to reset at U+0085 (Newline), U+2028 (Line Separator) and U+2029 (Paragraph Separator).


BOCU-1 performance

Markus Scherer
2002apr03

This document provides a performance comparison between BOCU-1, UTF-8, and SCSU when implemented as ICU converters, which convert to and from UTF-16.

Summary

Values are relative to UTF-8.

BOCU-1 SCSU
Languages Size of text Time to convert
to/from UTF-16
Size of text Time to convert
to/from UTF-16
English/French 100% 160..170% 100% 125%
Greek/Russian/Arabic/Hebrew 60% 65..70% 55% 70%
Hindi 40% 45% 40% 45%
Thai (see below) 45% 60% 40% 55%
Japanese 60% 150% 55% 110%
Korean 75% 155% 85% 70%
Chinese 75% 165% 70% 65%

(Smaller percentages are better. Percentages are rounded to the nearest 5%.)

The compression ratio is smaller for web pages (lots of ASCII in HTML). The performance difference tends to be smaller for smaller buffers. When the text is transmitted between machines (emails, web pages), then the transmission time may swamp the conversion time. Smaller text will then transmit faster.

SCSU vs. BOCU-1

Test setup

The texts are the "What is Unicode" pages from www.unicode.org, except for Thai. Note that english.html contains non-ASCII characters in the index sidebar.

The Thai text, th18057.txt, has a different structure: It is a Thai word list from ICU's test data, with one Thai word on each line. Compared to the other texts, it contains only a few characters between CRLF.

This comparison uses full-fledged ICU converters for UTF-8, SCSU and BOCU-1. "Full-fledged ICU converter" means that this is with the ICU conversion API, designed for external encodings, as used e.g. by an XML parser or web browser.
There are also ICU functions for transformations of in-process strings between UTF-8/16/32 that have a little less overhead (because they do not handle split-buffer conversions and customizable error handling). Something like that could be written for BOCU-1; I guess that the relative performance based on this would be similar to these numbers.

The test outputs the number of encoding bytes and the time to roundtrip-convert from the internal UTF-16 form to the encoding and back. The test machine is an IBM Thinkpad 570 (Pentium 2/366) with 192MB of RAM.

The ICU converter code for SCSU and BOCU-1 that is tested here is not currently part of the ICU CVS. The SCSU converter was optimized slightly (conversion function variants without offset handling). The BOCU-1 converter is optimized compared to the reference code in the design document. Code very similar to this may become part of ICU in 2002.


Full results

The output shows results for multiple files, for UTF-8, SCSU and BOCU-1 for each file. For each file and encoding there are two sets of numbers, one measured with a large intermediate buffer and one measured with a small intermediate buffer. In each case, roundtrip conversions are performed for 2 seconds and the total time is divided by the number of roundtrips.

# testing converter performance with file "arabic.html"
no Unicode signature - using UTF-8
number of code points                       9402 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 11192 B
  roundtrip conversion time              709.154 μs
  average bytes/code point               1.19039 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 11192 B
  roundtrip conversion time              1116.58 μs
  average bytes/code point               1.19039 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  9534 B  ( 85% of UTF-8)
  roundtrip conversion time              778.212 μs (109% of UTF-8)
  average bytes/code point               1.01404 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  9534 B
  roundtrip conversion time              1164.65 μs
  average bytes/code point               1.01404 B/cp

# testing converter performance with file "arabic.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       2270 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  4035 B
  roundtrip conversion time              337.073 μs
  average bytes/code point               1.77753 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  4035 B
  roundtrip conversion time              508.123 μs
  average bytes/code point               1.77753 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  2375 B  ( 58% of UTF-8)
  roundtrip conversion time              220.328 μs ( 65% of UTF-8)
  average bytes/code point               1.04626 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  2375 B
  roundtrip conversion time              323.184 μs
  average bytes/code point               1.04626 B/cp

# testing converter performance with file "english.html"
no Unicode signature - using UTF-8
number of code points                      14502 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 14695 B
  roundtrip conversion time              755.631 μs
  average bytes/code point               1.01331 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 14695 B
  roundtrip conversion time               1259.7 μs
  average bytes/code point               1.01331 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 14590 B  ( 99% of UTF-8)
  roundtrip conversion time              1150.54 μs (152% of UTF-8)
  average bytes/code point               1.00607 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                 14590 B
  roundtrip conversion time              1713.43 μs
  average bytes/code point               1.00607 B/cp

# testing converter performance with file "english.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       2975 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  2975 B
  roundtrip conversion time              148.619 μs
  average bytes/code point                     1 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  2975 B
  roundtrip conversion time              251.223 μs
  average bytes/code point                     1 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  2975 B  (100% of UTF-8)
  roundtrip conversion time              242.559 μs (163% of UTF-8)
  average bytes/code point                     1 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  2975 B
  roundtrip conversion time              359.415 μs
  average bytes/code point                     1 B/cp

# testing converter performance with file "french.html"
no Unicode signature - using UTF-8
number of code points                      10432 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 10506 B
  roundtrip conversion time              536.455 μs
  average bytes/code point               1.00709 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 10506 B
  roundtrip conversion time              897.748 μs
  average bytes/code point               1.00709 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 10578 B  (100% of UTF-8)
  roundtrip conversion time              846.154 μs (157% of UTF-8)
  average bytes/code point                 1.014 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                 10578 B
  roundtrip conversion time              1263.93 μs
  average bytes/code point                 1.014 B/cp

# testing converter performance with file "french.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       3415 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  3488 B
  roundtrip conversion time              188.281 μs
  average bytes/code point               1.02138 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  3488 B
  roundtrip conversion time              305.797 μs
  average bytes/code point               1.02138 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  3559 B  (102% of UTF-8)
  roundtrip conversion time              319.852 μs (169% of UTF-8)
  average bytes/code point               1.04217 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  3559 B
  roundtrip conversion time               470.63 μs
  average bytes/code point               1.04217 B/cp

# testing converter performance with file "greek.html"
no Unicode signature - using UTF-8
number of code points                      10544 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 13292 B
  roundtrip conversion time              878.842 μs
  average bytes/code point               1.26062 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 13292 B
  roundtrip conversion time              1386.92 μs
  average bytes/code point               1.26062 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 10760 B  ( 80% of UTF-8)
  roundtrip conversion time              901.917 μs (102% of UTF-8)
  average bytes/code point               1.02049 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                 10760 B
  roundtrip conversion time              1340.21 μs
  average bytes/code point               1.02049 B/cp

# testing converter performance with file "greek.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       3574 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  6313 B
  roundtrip conversion time              516.541 μs
  average bytes/code point               1.76637 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  6313 B
  roundtrip conversion time              782.116 μs
  average bytes/code point               1.76637 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  3736 B  ( 59% of UTF-8)
  roundtrip conversion time              343.872 μs ( 66% of UTF-8)
  average bytes/code point               1.04533 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  3736 B
  roundtrip conversion time              504.132 μs
  average bytes/code point               1.04533 B/cp

# testing converter performance with file "hebrew.html"
no Unicode signature - using UTF-8
number of code points                       9168 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 10953 B
  roundtrip conversion time              702.245 μs
  average bytes/code point                1.1947 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 10953 B
  roundtrip conversion time              1102.28 μs
  average bytes/code point                1.1947 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  9375 B  ( 85% of UTF-8)
  roundtrip conversion time              784.104 μs (111% of UTF-8)
  average bytes/code point               1.02258 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  9375 B
  roundtrip conversion time              1167.55 μs
  average bytes/code point               1.02258 B/cp

# testing converter performance with file "hebrew.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       2378 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  4150 B
  roundtrip conversion time              354.799 μs
  average bytes/code point               1.74516 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  4150 B
  roundtrip conversion time              525.437 μs
  average bytes/code point               1.74516 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  2544 B  ( 61% of UTF-8)
  roundtrip conversion time              244.949 μs ( 69% of UTF-8)
  average bytes/code point               1.06981 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  2544 B
  roundtrip conversion time              359.682 μs
  average bytes/code point               1.06981 B/cp

# testing converter performance with file "hindi.html"
no Unicode signature - using UTF-8
number of code points                       9813 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 14549 B
  roundtrip conversion time              1046.09 μs
  average bytes/code point               1.48263 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 14549 B
  roundtrip conversion time              1656.64 μs
  average bytes/code point               1.48263 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 10033 B  ( 68% of UTF-8)
  roundtrip conversion time              851.978 μs ( 81% of UTF-8)
  average bytes/code point               1.02242 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                 10033 B
  roundtrip conversion time              1263.72 μs
  average bytes/code point               1.02242 B/cp

# testing converter performance with file "hindi.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       2993 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  7703 B
  roundtrip conversion time              689.618 μs
  average bytes/code point               2.57367 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  7703 B
  roundtrip conversion time              1067.52 μs
  average bytes/code point               2.57367 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  3149 B  ( 40% of UTF-8)
  roundtrip conversion time              298.935 μs ( 43% of UTF-8)
  average bytes/code point               1.05212 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  3149 B
  roundtrip conversion time              435.626 μs
  average bytes/code point               1.05212 B/cp

# testing converter performance with file "japanese.html"
no Unicode signature - using UTF-8
number of code points                       7890 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 10726 B
  roundtrip conversion time              702.417 μs
  average bytes/code point               1.35944 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 10726 B
  roundtrip conversion time              1112.17 μs
  average bytes/code point               1.35944 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  9015 B  ( 84% of UTF-8)
  roundtrip conversion time              1077.62 μs (153% of UTF-8)
  average bytes/code point               1.14259 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  9015 B
  roundtrip conversion time              1464.18 μs
  average bytes/code point               1.14259 B/cp

# testing converter performance with file "japanese.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       1584 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  4400 B
  roundtrip conversion time               380.82 μs
  average bytes/code point               2.77778 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  4400 B
  roundtrip conversion time              580.748 μs
  average bytes/code point               2.77778 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  2648 B  ( 60% of UTF-8)
  roundtrip conversion time              567.302 μs (148% of UTF-8)
  average bytes/code point               1.67172 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  2648 B
  roundtrip conversion time              691.883 μs
  average bytes/code point               1.67172 B/cp

# testing converter performance with file "korean.html"
no Unicode signature - using UTF-8
number of code points                       8287 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 10401 B
  roundtrip conversion time              681.758 μs
  average bytes/code point                1.2551 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 10401 B
  roundtrip conversion time              1076.97 μs
  average bytes/code point                1.2551 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  9560 B  ( 91% of UTF-8)
  roundtrip conversion time              1062.04 μs (155% of UTF-8)
  average bytes/code point               1.15361 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  9560 B
  roundtrip conversion time              1494.03 μs
  average bytes/code point               1.15361 B/cp

# testing converter performance with file "korean.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       1587 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  3659 B
  roundtrip conversion time              328.427 μs
  average bytes/code point               2.30561 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  3659 B
  roundtrip conversion time               500.49 μs
  average bytes/code point               2.30561 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  2753 B  ( 75% of UTF-8)
  roundtrip conversion time              506.637 μs (154% of UTF-8)
  average bytes/code point               1.73472 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  2753 B
  roundtrip conversion time               655.06 μs
  average bytes/code point               1.73472 B/cp

# testing converter performance with file "russian.html"
no Unicode signature - using UTF-8
number of code points                      10869 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 13870 B
  roundtrip conversion time              909.221 μs
  average bytes/code point               1.27611 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                 13870 B
  roundtrip conversion time              1457.82 μs
  average bytes/code point               1.27611 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                 11083 B  ( 79% of UTF-8)
  roundtrip conversion time              922.192 μs (101% of UTF-8)
  average bytes/code point               1.01969 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                 11083 B
  roundtrip conversion time              1376.38 μs
  average bytes/code point               1.01969 B/cp

# testing converter performance with file "russian.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       3863 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  6856 B
  roundtrip conversion time              554.667 μs
  average bytes/code point               1.77479 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  6856 B
  roundtrip conversion time              845.028 μs
  average bytes/code point               1.77479 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  4025 B  ( 58% of UTF-8)
  roundtrip conversion time              362.311 μs ( 65% of UTF-8)
  average bytes/code point               1.04194 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  4025 B
  roundtrip conversion time              539.687 μs
  average bytes/code point               1.04194 B/cp

# testing converter performance with file "s-chinese.html"
no Unicode signature - using UTF-8
number of code points                       7374 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  8882 B
  roundtrip conversion time              539.687 μs
  average bytes/code point                1.2045 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  8882 B
  roundtrip conversion time              872.007 μs
  average bytes/code point                1.2045 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  8358 B  ( 94% of UTF-8)
  roundtrip conversion time              870.857 μs (161% of UTF-8)
  average bytes/code point               1.13344 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  8358 B
  roundtrip conversion time              1233.37 μs
  average bytes/code point               1.13344 B/cp

# testing converter performance with file "s-chinese.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       1056 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  2546 B
  roundtrip conversion time              218.711 μs
  average bytes/code point               2.41098 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  2546 B
  roundtrip conversion time              329.526 μs
  average bytes/code point               2.41098 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  1965 B  ( 77% of UTF-8)
  roundtrip conversion time              360.511 μs (164% of UTF-8)
  average bytes/code point                1.8608 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  1965 B
  roundtrip conversion time              469.584 μs
  average bytes/code point                1.8608 B/cp

# testing converter performance with file "t-chinese.html"
no Unicode signature - using UTF-8
number of code points                       7420 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  8946 B
  roundtrip conversion time              544.996 μs
  average bytes/code point               1.20566 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  8946 B
  roundtrip conversion time              886.394 μs
  average bytes/code point               1.20566 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  8410 B  ( 94% of UTF-8)
  roundtrip conversion time              877.587 μs (161% of UTF-8)
  average bytes/code point               1.13342 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  8410 B
  roundtrip conversion time              1248.76 μs
  average bytes/code point               1.13342 B/cp

# testing converter performance with file "t-chinese.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                       1066 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  2564 B
  roundtrip conversion time              219.228 μs
  average bytes/code point               2.40525 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                  2564 B
  roundtrip conversion time              338.077 μs
  average bytes/code point               2.40525 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                  1979 B  ( 77% of UTF-8)
  roundtrip conversion time              364.965 μs (166% of UTF-8)
  average bytes/code point               1.85647 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                  1979 B
  roundtrip conversion time              479.381 μs
  average bytes/code point               1.85647 B/cp

# testing converter performance with file "th18057.txt"
detected signature for UTF-8 (removing 3 bytes)
number of code points                     135723 cp
platform endianness:                      little-endian

* performance report for                   UTF-8:
  intermediate buffer capacity              4096 B
  number of encoding bytes                334025 B
  roundtrip conversion time              31738.5 μs
  average bytes/code point               2.46108 B/cp

* performance report for                   UTF-8:
  intermediate buffer capacity                20 B
  number of encoding bytes                334025 B
  roundtrip conversion time              46659.1 μs
  average bytes/code point               2.46108 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity              4096 B
  number of encoding bytes                153648 B  ( 45% of UTF-8)
  roundtrip conversion time              19735.3 μs ( 62% of UTF-8)
  average bytes/code point               1.13207 B/cp

* performance report for                  BOCU-1:
  intermediate buffer capacity                20 B
  number of encoding bytes                153648 B
  roundtrip conversion time                26440 μs
  average bytes/code point               1.13207 B/cp

Copyright (c) 2002 International Business Machines Corporation and others. All Rights Reserved.

For the ICU license and other information please see the ICU homepage.