A MIME-compatible application of the Binary Ordered Compression for Unicode base algorithm.
Markus Scherer & Mark Davis
San José 2002-jan-29
This document describes a Unicode encoding for the storage and exchange of text data. It is stateful and provides a good byte/code point ratio while being directly usable in SMTP emails, database fields and other contexts.
Note that Unicode Technical Note #6 supersedes this design document.
It is complete with sample code and a Windows executable
but does not contain the detailed analysis at the bottom of this document.
Markus 2002-dec-17
The most popular encoding for Unicode text data exchange is UTF-8. It is relatively simple and widely applicable: MIME/email/HTML/XML, in-process use in some systems, etc. However, UTF-8 uses more bytes per code point for non-Latin scripts than language-specific codepages.
In some markets where scripts other than Latin are used, the high bytes/code point ratio of UTF-8 (and of UTF-16 for scripts other than Latin and CJK ideographs) has been criticized and used as a motivation to not use Unicode for the exchange of documents in some languages. Larger document sizes are also a problem in low-bandwidth environments such as wireless networks for handheld systems.
SCSU was created as a Unicode compression scheme with a byte/code point ratio similar to language-specific codepages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered as one. SCSU is not suitable for MIME "text" media types, i.e., it cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance.
BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU. It is useful for short strings and maintains code point order.
BOCU-1 is a stateful, byte-oriented, encoding of Unicode. Like SCSU it can be classified as a Character Encoding Scheme (CES) or as a Transfer Encoding Syntax (TES). It is a "charset" according to IANA, and it is suitable for MIME "text" media types.
BOCU-1 is more complicated than UTF-8 but much simpler than SCSU. Its inter-character state consists of a single integer. It is deterministic, i.e., the same complete input text is always encoded the same way by all encoders.
The byte/code point ratio is 1 for runs of code points from the same block of 0x80 code points (and for Hiragana), and 2 for runs of CJK Unihan code points, as with SCSU. This is much better than UTF-8 for Indic scripts, Russian, Greek, Arabic, Hebrew, etc. The startup overhead is very low (similar to SCSU), which makes it useful for very short strings like individual names. The maximum number of bytes per code point is four.
The lexical order of BOCU-1 bytes is the same as the code point order of the original text — like UTF-8 but unlike SCSU — which allows the compression of large, sorted lists of strings.
The C0 control codes NUL, CR and LF and nine others are encoded with the same byte values as in US-ASCII, and those byte values are used only for the encoding of these twelve C0 control codes. This makes BOCU-1 suitable for MIME "text" media types, directly usable in emails and generally "friendly" for ASCII-based tools. The SUB control and its byte value 1A is included in this set to avoid problems in DOS/Windows/OS/2 systems where 1A is interpreted as an end-of-file marker in text mode.
The state is reset at each C0 control (U+0000..U+001F, includes CR, LF, TAB). CRLF-separated lines do not affect each other's encoding. Together with BOCU-1 being deterministic, this allows line-based file comparisons (diff) and makes BOCU-1 usable with RCS, CVS and other source-code control systems (unlike SCSU). This also allows some limited random access.
Byte values for single-byte codes and lead bytes overlap with trail bytes. So unlike UTF-8, character boundaries cannot be determined in random access, except by backing up to a reset point.
Byte values 7F..9F (DEL and C1 control codes) are used as lead and trail bytes.
US-ASCII characters (code points U+0021..U+007E) are not encoded with the same bytes as in US-ASCII. Therefore, the charset must be specified with a signature byte sequence or in a higher-level protocol.
As a single/lead byte, byte value FF is used as a special "reset-state-only" command.
It does not encode any code point. FF is also a normal trail byte.
Having a "reset only" command allows simple concatenation of BOCU-1 byte streams.
(All other BOCU-1 byte sequences would append some code point.)
Using FF to reset the state breaks the ordering!
The use of FF resets is discouraged..
Byte stream concatenation without resetting with FF requires to scan back to
a C0 control whose byte value is not used for trail bytes
(last known reset to initial state);
then decode to the end of the first stream and encode the first non-U+0020 code point
of the second stream according to the current state;
then append the rest of the second stream.
The same procedure could be used to remove an FF reset command.
An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.
The basic algorithm is as described in Binary Ordered Compression for Unicode.
BOCU-1 differs from the generic algorithm by using a different set of byte value ranges and by encoding U+0000..U+0020 directly with byte values 00..20. In addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of each word.
Partial pseudo-code for a per-code point encoding function is as follows:
encode(int &prev, int c) {
if(c<=0x20) {
output (byte)c;
if(c!=0x20) {
prev=0x40;
}
} else {
int diff=c-prev;
// encode diff in 1..4 bytes and output them
// adjust prev
if(c is Hiragana) {
prev=middle of Hiragana;
} else if(c is CJK Unihan) {
prev=middle of CJK Unihan;
} else if(c is Hangul) {
prev=middle of Hangul;
} else {
prev=(c&~0x7f)+0x40;
}
}
}
The sample C code serves as the full specification of BOCU-1. Every conformant encoder and decoder must generate equivalent output and detect any illegal input code points and illegal input byte sequences. Recovery from illegal input is not specified. Single surrogates are encoded if present in the input (e.g., unmatched single surrogate code units in UTF-16). Proper input of supplementary code points (e.g., matched surrogate pairs in UTF-16) must be encoded by code points.
This code uses ICU standard
headers and the one implementation file icu/source/common/utf_impl.c.
(It is not necessary to link the entire ICU common library.) This is for
convenience in the surrounding test functions and not necessary for the core
BOCU-1 functions. These headers and implementation file provide the following:
This code is under the X license (ICU version).
Files:
main()
function, see below)A complete, compiled sample executable for Windows from this source code is available for download. See Unicode Technical Note #6 and the note at the top of this document. Aside from basic implementation and consistency tests, this also provides file conversion between UTF-8 and BOCU-1. Use a command-line argument of "?" or "-h" for usage.
2002-apr-12: Use single/lead byte value FF as "reset only" without code point output. FF is still also used as a trail byte.
2002-mar-20: Remove resetting at NL, LS, PS (see change from feb21). Reason: Resetting for them makes the algorithm more complicated without much gain. These characters are rarely used and their BOCU encoding still depends on the preceding code point, unlike with LF etc. The only gain would be when NL/LS/PS were used with US-ASCII text, where the first character on a line would need one byte instead of two.
2002-feb-21: Change adjustment of prev to account for Hiragana (not 128-aligned) and to reset at U+0085 (Newline), U+2028 (Line Separator) and U+2029 (Paragraph Separator).
Markus Scherer
2002apr03
This document provides a performance comparison between BOCU-1, UTF-8, and SCSU when implemented as ICU converters, which convert to and from UTF-16.
Values are relative to UTF-8.
| BOCU-1 | SCSU | |||
|---|---|---|---|---|
| Languages | Size of text | Time to convert to/from UTF-16 |
Size of text | Time to convert to/from UTF-16 |
| English/French | 100% | 160..170% | 100% | 125% |
| Greek/Russian/Arabic/Hebrew | 60% | 65..70% | 55% | 70% |
| Hindi | 40% | 45% | 40% | 45% |
| Thai (see below) | 45% | 60% | 40% | 55% |
| Japanese | 60% | 150% | 55% | 110% |
| Korean | 75% | 155% | 85% | 70% |
| Chinese | 75% | 165% | 70% | 65% |
(Smaller percentages are better. Percentages are rounded to the nearest 5%.)
The compression ratio is smaller for web pages (lots of ASCII in HTML). The performance difference tends to be smaller for smaller buffers. When the text is transmitted between machines (emails, web pages), then the transmission time may swamp the conversion time. Smaller text will then transmit faster.
The texts are the "What is Unicode" pages from www.unicode.org, except for Thai. Note that english.html contains non-ASCII characters in the index sidebar.
The Thai text, th18057.txt, has a different structure: It is a Thai word list from ICU's test data, with one Thai word on each line. Compared to the other texts, it contains only a few characters between CRLF.
This comparison uses full-fledged ICU converters for UTF-8, SCSU and BOCU-1.
"Full-fledged ICU converter" means that this is with the ICU conversion API,
designed for external encodings, as used e.g. by an XML parser or web browser.
There are also ICU functions for transformations of in-process strings
between UTF-8/16/32 that have a little less overhead
(because they do not handle split-buffer conversions and customizable error handling).
Something like that could be written for BOCU-1;
I guess that the relative performance based on this would be similar to these numbers.
The test outputs the number of encoding bytes and the time to roundtrip-convert from the internal UTF-16 form to the encoding and back. The test machine is an IBM Thinkpad 570 (Pentium 2/366) with 192MB of RAM.
The ICU converter code for SCSU and BOCU-1 that is tested here is not currently part of the ICU CVS. The SCSU converter was optimized slightly (conversion function variants without offset handling). The BOCU-1 converter is optimized compared to the reference code in the design document. Code very similar to this may become part of ICU in 2002.
The output shows results for multiple files, for UTF-8, SCSU and BOCU-1 for each file. For each file and encoding there are two sets of numbers, one measured with a large intermediate buffer and one measured with a small intermediate buffer. In each case, roundtrip conversions are performed for 2 seconds and the total time is divided by the number of roundtrips.
# testing converter performance with file "arabic.html" no Unicode signature - using UTF-8 number of code points 9402 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 11192 B roundtrip conversion time 709.154 μs average bytes/code point 1.19039 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 11192 B roundtrip conversion time 1116.58 μs average bytes/code point 1.19039 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 9534 B ( 85% of UTF-8) roundtrip conversion time 778.212 μs (109% of UTF-8) average bytes/code point 1.01404 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 9534 B roundtrip conversion time 1164.65 μs average bytes/code point 1.01404 B/cp # testing converter performance with file "arabic.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 2270 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 4035 B roundtrip conversion time 337.073 μs average bytes/code point 1.77753 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 4035 B roundtrip conversion time 508.123 μs average bytes/code point 1.77753 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 2375 B ( 58% of UTF-8) roundtrip conversion time 220.328 μs ( 65% of UTF-8) average bytes/code point 1.04626 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 2375 B roundtrip conversion time 323.184 μs average bytes/code point 1.04626 B/cp # testing converter performance with file "english.html" no Unicode signature - using UTF-8 number of code points 14502 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 14695 B roundtrip conversion time 755.631 μs average bytes/code point 1.01331 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 14695 B roundtrip conversion time 1259.7 μs average bytes/code point 1.01331 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 14590 B ( 99% of UTF-8) roundtrip conversion time 1150.54 μs (152% of UTF-8) average bytes/code point 1.00607 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 14590 B roundtrip conversion time 1713.43 μs average bytes/code point 1.00607 B/cp # testing converter performance with file "english.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 2975 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 2975 B roundtrip conversion time 148.619 μs average bytes/code point 1 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 2975 B roundtrip conversion time 251.223 μs average bytes/code point 1 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 2975 B (100% of UTF-8) roundtrip conversion time 242.559 μs (163% of UTF-8) average bytes/code point 1 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 2975 B roundtrip conversion time 359.415 μs average bytes/code point 1 B/cp # testing converter performance with file "french.html" no Unicode signature - using UTF-8 number of code points 10432 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 10506 B roundtrip conversion time 536.455 μs average bytes/code point 1.00709 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 10506 B roundtrip conversion time 897.748 μs average bytes/code point 1.00709 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 10578 B (100% of UTF-8) roundtrip conversion time 846.154 μs (157% of UTF-8) average bytes/code point 1.014 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 10578 B roundtrip conversion time 1263.93 μs average bytes/code point 1.014 B/cp # testing converter performance with file "french.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 3415 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 3488 B roundtrip conversion time 188.281 μs average bytes/code point 1.02138 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 3488 B roundtrip conversion time 305.797 μs average bytes/code point 1.02138 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 3559 B (102% of UTF-8) roundtrip conversion time 319.852 μs (169% of UTF-8) average bytes/code point 1.04217 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 3559 B roundtrip conversion time 470.63 μs average bytes/code point 1.04217 B/cp # testing converter performance with file "greek.html" no Unicode signature - using UTF-8 number of code points 10544 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 13292 B roundtrip conversion time 878.842 μs average bytes/code point 1.26062 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 13292 B roundtrip conversion time 1386.92 μs average bytes/code point 1.26062 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 10760 B ( 80% of UTF-8) roundtrip conversion time 901.917 μs (102% of UTF-8) average bytes/code point 1.02049 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 10760 B roundtrip conversion time 1340.21 μs average bytes/code point 1.02049 B/cp # testing converter performance with file "greek.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 3574 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 6313 B roundtrip conversion time 516.541 μs average bytes/code point 1.76637 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 6313 B roundtrip conversion time 782.116 μs average bytes/code point 1.76637 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 3736 B ( 59% of UTF-8) roundtrip conversion time 343.872 μs ( 66% of UTF-8) average bytes/code point 1.04533 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 3736 B roundtrip conversion time 504.132 μs average bytes/code point 1.04533 B/cp # testing converter performance with file "hebrew.html" no Unicode signature - using UTF-8 number of code points 9168 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 10953 B roundtrip conversion time 702.245 μs average bytes/code point 1.1947 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 10953 B roundtrip conversion time 1102.28 μs average bytes/code point 1.1947 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 9375 B ( 85% of UTF-8) roundtrip conversion time 784.104 μs (111% of UTF-8) average bytes/code point 1.02258 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 9375 B roundtrip conversion time 1167.55 μs average bytes/code point 1.02258 B/cp # testing converter performance with file "hebrew.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 2378 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 4150 B roundtrip conversion time 354.799 μs average bytes/code point 1.74516 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 4150 B roundtrip conversion time 525.437 μs average bytes/code point 1.74516 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 2544 B ( 61% of UTF-8) roundtrip conversion time 244.949 μs ( 69% of UTF-8) average bytes/code point 1.06981 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 2544 B roundtrip conversion time 359.682 μs average bytes/code point 1.06981 B/cp # testing converter performance with file "hindi.html" no Unicode signature - using UTF-8 number of code points 9813 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 14549 B roundtrip conversion time 1046.09 μs average bytes/code point 1.48263 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 14549 B roundtrip conversion time 1656.64 μs average bytes/code point 1.48263 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 10033 B ( 68% of UTF-8) roundtrip conversion time 851.978 μs ( 81% of UTF-8) average bytes/code point 1.02242 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 10033 B roundtrip conversion time 1263.72 μs average bytes/code point 1.02242 B/cp # testing converter performance with file "hindi.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 2993 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 7703 B roundtrip conversion time 689.618 μs average bytes/code point 2.57367 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 7703 B roundtrip conversion time 1067.52 μs average bytes/code point 2.57367 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 3149 B ( 40% of UTF-8) roundtrip conversion time 298.935 μs ( 43% of UTF-8) average bytes/code point 1.05212 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 3149 B roundtrip conversion time 435.626 μs average bytes/code point 1.05212 B/cp # testing converter performance with file "japanese.html" no Unicode signature - using UTF-8 number of code points 7890 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 10726 B roundtrip conversion time 702.417 μs average bytes/code point 1.35944 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 10726 B roundtrip conversion time 1112.17 μs average bytes/code point 1.35944 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 9015 B ( 84% of UTF-8) roundtrip conversion time 1077.62 μs (153% of UTF-8) average bytes/code point 1.14259 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 9015 B roundtrip conversion time 1464.18 μs average bytes/code point 1.14259 B/cp # testing converter performance with file "japanese.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 1584 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 4400 B roundtrip conversion time 380.82 μs average bytes/code point 2.77778 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 4400 B roundtrip conversion time 580.748 μs average bytes/code point 2.77778 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 2648 B ( 60% of UTF-8) roundtrip conversion time 567.302 μs (148% of UTF-8) average bytes/code point 1.67172 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 2648 B roundtrip conversion time 691.883 μs average bytes/code point 1.67172 B/cp # testing converter performance with file "korean.html" no Unicode signature - using UTF-8 number of code points 8287 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 10401 B roundtrip conversion time 681.758 μs average bytes/code point 1.2551 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 10401 B roundtrip conversion time 1076.97 μs average bytes/code point 1.2551 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 9560 B ( 91% of UTF-8) roundtrip conversion time 1062.04 μs (155% of UTF-8) average bytes/code point 1.15361 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 9560 B roundtrip conversion time 1494.03 μs average bytes/code point 1.15361 B/cp # testing converter performance with file "korean.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 1587 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 3659 B roundtrip conversion time 328.427 μs average bytes/code point 2.30561 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 3659 B roundtrip conversion time 500.49 μs average bytes/code point 2.30561 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 2753 B ( 75% of UTF-8) roundtrip conversion time 506.637 μs (154% of UTF-8) average bytes/code point 1.73472 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 2753 B roundtrip conversion time 655.06 μs average bytes/code point 1.73472 B/cp # testing converter performance with file "russian.html" no Unicode signature - using UTF-8 number of code points 10869 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 13870 B roundtrip conversion time 909.221 μs average bytes/code point 1.27611 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 13870 B roundtrip conversion time 1457.82 μs average bytes/code point 1.27611 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 11083 B ( 79% of UTF-8) roundtrip conversion time 922.192 μs (101% of UTF-8) average bytes/code point 1.01969 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 11083 B roundtrip conversion time 1376.38 μs average bytes/code point 1.01969 B/cp # testing converter performance with file "russian.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 3863 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 6856 B roundtrip conversion time 554.667 μs average bytes/code point 1.77479 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 6856 B roundtrip conversion time 845.028 μs average bytes/code point 1.77479 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 4025 B ( 58% of UTF-8) roundtrip conversion time 362.311 μs ( 65% of UTF-8) average bytes/code point 1.04194 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 4025 B roundtrip conversion time 539.687 μs average bytes/code point 1.04194 B/cp # testing converter performance with file "s-chinese.html" no Unicode signature - using UTF-8 number of code points 7374 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 8882 B roundtrip conversion time 539.687 μs average bytes/code point 1.2045 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 8882 B roundtrip conversion time 872.007 μs average bytes/code point 1.2045 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 8358 B ( 94% of UTF-8) roundtrip conversion time 870.857 μs (161% of UTF-8) average bytes/code point 1.13344 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 8358 B roundtrip conversion time 1233.37 μs average bytes/code point 1.13344 B/cp # testing converter performance with file "s-chinese.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 1056 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 2546 B roundtrip conversion time 218.711 μs average bytes/code point 2.41098 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 2546 B roundtrip conversion time 329.526 μs average bytes/code point 2.41098 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 1965 B ( 77% of UTF-8) roundtrip conversion time 360.511 μs (164% of UTF-8) average bytes/code point 1.8608 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 1965 B roundtrip conversion time 469.584 μs average bytes/code point 1.8608 B/cp # testing converter performance with file "t-chinese.html" no Unicode signature - using UTF-8 number of code points 7420 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 8946 B roundtrip conversion time 544.996 μs average bytes/code point 1.20566 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 8946 B roundtrip conversion time 886.394 μs average bytes/code point 1.20566 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 8410 B ( 94% of UTF-8) roundtrip conversion time 877.587 μs (161% of UTF-8) average bytes/code point 1.13342 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 8410 B roundtrip conversion time 1248.76 μs average bytes/code point 1.13342 B/cp # testing converter performance with file "t-chinese.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 1066 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 2564 B roundtrip conversion time 219.228 μs average bytes/code point 2.40525 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 2564 B roundtrip conversion time 338.077 μs average bytes/code point 2.40525 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 1979 B ( 77% of UTF-8) roundtrip conversion time 364.965 μs (166% of UTF-8) average bytes/code point 1.85647 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 1979 B roundtrip conversion time 479.381 μs average bytes/code point 1.85647 B/cp # testing converter performance with file "th18057.txt" detected signature for UTF-8 (removing 3 bytes) number of code points 135723 cp platform endianness: little-endian * performance report for UTF-8: intermediate buffer capacity 4096 B number of encoding bytes 334025 B roundtrip conversion time 31738.5 μs average bytes/code point 2.46108 B/cp * performance report for UTF-8: intermediate buffer capacity 20 B number of encoding bytes 334025 B roundtrip conversion time 46659.1 μs average bytes/code point 2.46108 B/cp * performance report for BOCU-1: intermediate buffer capacity 4096 B number of encoding bytes 153648 B ( 45% of UTF-8) roundtrip conversion time 19735.3 μs ( 62% of UTF-8) average bytes/code point 1.13207 B/cp * performance report for BOCU-1: intermediate buffer capacity 20 B number of encoding bytes 153648 B roundtrip conversion time 26440 μs average bytes/code point 1.13207 B/cp
Copyright (c) 2002 International Business Machines Corporation and others. All Rights Reserved.
For the ICU license and other information please see the ICU homepage.