Goal: take a block of bytes (2B..xMB) and try to determine with some certainty the text encoding and the main language. Assume that the input is actually real text in a written language — not random text, "gibberish", or in a binary format.
The general idea is to combine multiple methods that each are designed work well on part of the problem, but only the combination is expected to work well on "all" texts. In principle, go from simple, fast methods to more complicated ones as necessary. Some algorithms can be used in parallel, feeding input bytes to multiple "engines", or what Shanjian calls Parallel State Machines.
Existing methods appear to be limited to the usage cases of their applications. For example, the Netscape 6.1 charset detection concentrates on emails and web pages, which seems to ignore UTF-32, EBCDIC, and SCSU.
Netscape 6.1 implements charset detection with a mixture of Parallel State Machines for encoding detection, UTF-16 BOM testing, and byte/di-byte frequency analysis. See the presentation at IUC 19, B6, A Composite Approach to Language/Encoding Detection. See also Mozilla bug number 33337.
Discussion about charset detection on the www-international@w3.org mailing list, snapshot from 2001-sep-06:
----- Original Message ----- From: "Thierry Sourbier" <webmaster@i18ngurus.com> To: "W3intl (E-mail)" <www-international@w3.org> Sent: Thursday, September 06, 2001 21:17 Subject: Re: auto-detecting the character encoding of an uploaded file > > Yes. Shift_JIS can have bytes in the 0x80-0x9F range, but Latin-1 doesn't. > > If there is such a byte, Shift_JIS cannot be misinterpreted as Latin-1. > > It may be misinterpreted as windows-1252, but that's a different story. > > The value of one byte may indeed not be enough to differentiate between > various encodings, but for most european languages it is fairly rare to have > consecutive extended characters (by extended I mean with a code value above > 127). Therefore a Shift-JIS encoded Japanese text and a European > windows-1252 one are fairly easy to differentiate when you look at the > entire stream. > > Lenny, you might want to have a look at TextCat. This little tool (which > source code is available) helps you recognize 69 languages and encoding > combinations. It probably can easily be extended to more. Byte pattern > analysis is probably the most generic way to go, it gives great result even > on fairly small text size. > > Text cat can be found at: > http://odur.let.rug.nl/~vannoord/TextCat/ > > More links on languages indentification tools & techniques can be found at: > http://www.i18ngurus.com/docs/998504805.html > > Cheers, > Thierry > > <><><><><><><><><><><><><><><><><><><><><><> > www.i18ngurus.com - Open Internationalization Resources Directory
----- Message from Merle Tenney <Merle.Tenney@corp.palm.com> on Thu, 6 Sep 2001 14:20:51 -0700 -----
|
To: |
"'Bob Jung'" <bobj@netscape.com>, Martin Duerst <duerst@w3.org> |
|
cc: |
vinod@filemaker.com, Lenny Turetsky <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>, Shanjian Li <shanjian@netscape.com>, momoi@netscape.com |
|
Subject: |
RE: auto-detecting the character encoding of an uploaded file |
Thanks, Bob, for the reference to your team's upcoming paper (and hi, by the way). Most of this discussion has focused on subtle differences in legal codepoints in various encodings and legal patterns of bytes in encodings. There is another approach, though, which is much more effective and gives you valuable additional information to boot. That is a system based on relative byte and n-gram frequencies, which have characteristic patterns for a given pair of language and encoding. So in English, for example, "e", "th", "tion", and " the " are quite common, whereas in Spanish "os ", "ción", and " un " are quite common. In different encodings, say Windows 1252, MacRoman, UTF-8, UTF-16BE, and UTF-16LE, this will translate into different relative n-gram frequencies. The way the system works is that first a development version is exposed to a reasonable corpus of a particular language in a particular encoding. For some reason, 100K words seems to stand out in my mind. The system then empirically calculates the frequencies for this corpus, and stores the salient frequencies in a table. And the amazing thing is that it is fast and accurate, and it requires zero intervention by encoding specialists or linguists. This table is packaged up with the tables for the other languages and encodings previously profiled and these tables are included with the auto-detection software, which is normally bundled with a browser, search engine, word processor, etc. In actual use, a passage of user text has *its* n-gram frequencies calculated on the fly, and these are compared to the stored profiles, and a match is made for the profile that most closely matches the sample. In practice, it is surprisingly good and works on a surprisingly short text sample. You can usually determine the language and encoding of the sample with near certainty in a line or two of text. To the best of my knowledge, this approach was first proposed and developed by Ken Beesley in the late 80s while he was at ALP Systems. Here is a reference to that early work: http://www.xrce.xerox.com/people/beesley/langid.html Ken subsequently joined Xerox PARC and then XRCE in Grenoble, where the work was picked up. It was later commercialized by Xerox's spin-off InXight as part of their LinguistX Platform product: http://www.inxight.com/products_sp/linguistx/index.html I know that Microsoft has developed a similar technology, which is shown off quite well in their multilingual spelling checking in Word. However, I don't think it is available to developers outside of Microsoft. Inso also had a competitive technology, which they sold to Lernout & Hauspie. It is the now called the IntelliScope Language Recognizer, and it is part of their IntelliScope Retrieval Toolkit, described here: http://www.lhsl.com/tech/icm/retrieval/toolkit/lr.asp I can't tell from your brief description, Bob, if n-gram frequencies (under a different name) are part of your Mozilla work or not. If they're not, they should be. :-) The bottom line, folks, is that there are a lot better technologies available which allow you to automatically detect encodings, and they come with the tremendous additional benefit of being able to identify the language as well. We can all imagine lots of ways we could use that information. Maybe some of you will start sniffing down a different trail for a solution here.... Merle > -----Original Message----- > From: bobj@netscape.com [mailto:bobj@netscape.com] > Sent: Thursday, September 06, 2001 8:11 PM > To: Martin Duerst > Cc: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail); Shanjian Li; > momoi@netscape.com > Subject: Re: auto-detecting the character encoding of an uploaded file > > > FYI, there will be a paper presented at the Nineteenth International > Unicode Conference (IUC19), to be held on September 10 - 14, > 2001 in San > Jose, California : > > A Composite Approach to Language/Encoding Detection > by Shanjian Li & Katsuhiko Momoi - Netscape Communications > > Session B6: Wednesday, September 12, 2:50 pm - 3:30 pm > > Abstract: http://www.unicode.org/iuc/iuc19/a322.html > > And since this is part of Mozilla, it is all open source! > > http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/ > > -Bob > > Martin Duerst wrote: > > > At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote: > > > >> Lenny , > >> > >> Just some thoughts. > >> > >> Since you have mentioned Shift-JIS, > > > > > > As a charset, spelled shift_jis (case doesn't matter, but the > > underscore does). > > > > > >> there is no guarantee that every other > >> byte in UTF-16 is zero especially for non-us systems like > >> Japanese/European > >> . > > > > > > No. But if you see even a single zero byte, then the chance that the > > document is in UTF-16 is very high. > > > > > >> Also there is no significance for BOM for UTF-8, which > means not all > >> applications will add a BOM for the UTF-8 text. > > > > > > Yes indeed, for many reasons, adding a BOM to UTF-8 texts is > > discouraged. Detecting UTF-8 is easy enough without a BOM. > > > > > >> Finally, I don't think we > >> can come up with an auto-detect algorithm for detecting > >> Latin-1/UTF-*/Shift-JIS. > > > > > > For all these, it's not too difficult. Shift-JIS uses bytes in > > the 0x80-0x9F range, and has specific patterns. If there are > > only very few characters outside us-ascii, it may not work, > > but with more non-us-ascii characters, the probability > > of success is going up very quickly. > > > > > > Regards, Martin. > > > >
----- Message from vinod_balakrishnan@filemaker.com (Vinod Balakrishnan) on Thu, 6 Sep 2001 10:40:14 -0700 -----
|
To: |
"Martin Duerst" <duerst@w3.org>, "Lenny Turetsky" <LTuretsky@salesforce.com>, "W3intl \(E-mail\)" <www-international@w3.org> |
|
Subject: |
RE: auto-detecting the character encoding of an uploaded file |
>For all these, it's not too difficult. Shift-JIS uses bytes in >the 0x80-0x9F range, and has specific patterns. If there are >only very few characters outside us-ascii, it may not work, >but with more non-us-ascii characters, the probability >of success is going up very quickly. Shift-JIS represent the trailing bytes of double byte in 0x40-0xF0 range ( only the leading byte is in high ASCII range ) . Also the Hankaku (single byte) Kana is represented as single byte in the high ASCII range. A Japanese text in Shift-JIS contains single byte kana and Kanji characters can be misinterpreted as Latin-1 -Vinod -----Original Message----- From: Martin Duerst [mailto:duerst@w3.org] Sent: Wednesday, September 05, 2001 6:36 PM To: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail) Subject: RE: auto-detecting the character encoding of an uploaded file At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote: >Lenny , > >Just some thoughts. > >Since you have mentioned Shift-JIS, As a charset, spelled shift_jis (case doesn't matter, but the underscore does). >there is no guarantee that every other >byte in UTF-16 is zero especially for non-us systems like Japanese/European >. No. But if you see even a single zero byte, then the chance that the document is in UTF-16 is very high. >Also there is no significance for BOM for UTF-8, which means not all >applications will add a BOM for the UTF-8 text. Yes indeed, for many reasons, adding a BOM to UTF-8 texts is discouraged. Detecting UTF-8 is easy enough without a BOM. >Finally, I don't think we >can come up with an auto-detect algorithm for detecting >Latin-1/UTF-*/Shift-JIS. For all these, it's not too difficult. Shift-JIS uses bytes in the 0x80-0x9F range, and has specific patterns. If there are only very few characters outside us-ascii, it may not work, but with more non-us-ascii characters, the probability of success is going up very quickly. Regards, Martin.