Ideas for Charset detection

Goal: take a block of bytes (2B..xMB) and try to determine with some certainty the text encoding and the main language. Assume that the input is actually real text in a written language — not random text, "gibberish", or in a binary format.

The general idea is to combine multiple methods that each are designed work well on part of the problem, but only the combination is expected to work well on "all" texts. In principle, go from simple, fast methods to more complicated ones as necessary. Some algorithms can be used in parallel, feeding input bytes to multiple "engines", or what Shanjian calls Parallel State Machines.

Existing methods appear to be limited to the usage cases of their applications. For example, the Netscape 6.1 charset detection concentrates on emails and web pages, which seems to ignore UTF-32, EBCDIC, and SCSU.

Separate Methods

References

Netscape 6.1 implements charset detection with a mixture of Parallel State Machines for encoding detection, UTF-16 BOM testing, and byte/di-byte frequency analysis. See the presentation at IUC 19, B6, A Composite Approach to Language/Encoding Detection. See also Mozilla bug number 33337.


Discussion about charset detection on the www-international@w3.org mailing list, snapshot from 2001-sep-06:

  ----- Original Message -----
  From: "Thierry Sourbier" <webmaster@i18ngurus.com>
  To: "W3intl (E-mail)" <www-international@w3.org>
  Sent: Thursday, September 06, 2001 21:17
  Subject: Re: auto-detecting the character encoding of an uploaded file

  > > Yes. Shift_JIS can have bytes in the 0x80-0x9F range, but Latin-1
  doesn't.
  > > If there is such a byte, Shift_JIS cannot be misinterpreted as
  Latin-1.
  > > It may be misinterpreted as windows-1252, but that's a different
  story.
  >
  > The value of one byte may indeed not be enough to differentiate
  between
  > various encodings, but for most european languages it is fairly rare
  to
  have
  > consecutive extended characters (by extended I mean with a code value
  above
  > 127). Therefore a Shift-JIS encoded Japanese text and a European
  > windows-1252 one are fairly easy to differentiate when you look at the
  > entire stream.
  >
  > Lenny, you might want to have a look at TextCat. This little tool
  (which
  > source code is available) helps you recognize 69 languages and
  encoding
  > combinations. It probably can easily be extended to more. Byte pattern
  > analysis is probably the most generic way to go, it gives great result
  even
  > on fairly small text size.
  >
  > Text cat can be found at:
  > http://odur.let.rug.nl/~vannoord/TextCat/
  >
  > More links on languages indentification tools & techniques can be
  found
  at:
  > http://www.i18ngurus.com/docs/998504805.html
  >
  > Cheers,
  > Thierry
  >
  >
  <><><><><><><><><><><><><><><><><><><><><><>
  > www.i18ngurus.com - Open Internationalization Resources Directory
  

 

 

 

----- Message from Merle Tenney <Merle.Tenney@corp.palm.com> on Thu, 6 Sep 2001 14:20:51 -0700 -----

To:

"'Bob Jung'" <bobj@netscape.com>, Martin Duerst <duerst@w3.org>

cc:

vinod@filemaker.com, Lenny Turetsky <LTuretsky@salesforce.com>, "W3intl (E-mail)" <www-international@w3.org>, Shanjian Li <shanjian@netscape.com>, momoi@netscape.com

Subject:

RE: auto-detecting the character encoding of an uploaded file

 

  Thanks, Bob, for the reference to your team's upcoming paper (and hi, by
  the
  way).
  Most of this discussion has focused on subtle differences in legal
  codepoints in various encodings and legal patterns of bytes in encodings.
  There is another approach, though, which is much more effective and gives
  you valuable additional information to boot. That is a system based on
  relative byte and n-gram frequencies, which have characteristic patterns
  for
  a given pair of language and encoding. So in English, for example,
  "e",
  "th", "tion", and " the " are quite common,
  whereas in Spanish "os ",
  "ción", and " un " are quite common. In different
  encodings, say Windows
  1252, MacRoman, UTF-8, UTF-16BE, and UTF-16LE, this will translate into
  different relative n-gram frequencies.
  The way the system works is that first a development version is exposed to
  a
  reasonable corpus of a particular language in a particular encoding. For
  some reason, 100K words seems to stand out in my mind. The system then
  empirically calculates the frequencies for this corpus, and stores the
  salient frequencies in a table. And the amazing thing is that it is fast
  and accurate, and it requires zero intervention by encoding specialists or
  linguists. This table is packaged up with the tables for the other
  languages and encodings previously profiled and these tables are included
  with the auto-detection software, which is normally bundled with a browser,
  search engine, word processor, etc. In actual use, a passage of user text
  has *its* n-gram frequencies calculated on the fly, and these are compared
  to the stored profiles, and a match is made for the profile that most
  closely matches the sample. In practice, it is surprisingly good and works
  on a surprisingly short text sample. You can usually determine the language
  and encoding of the sample with near certainty in a line or two of text.
  To the best of my knowledge, this approach was first proposed and developed
  by Ken Beesley in the late 80s while he was at ALP Systems. Here is a
  reference to that early work:
  http://www.xrce.xerox.com/people/beesley/langid.html
  Ken subsequently joined Xerox PARC and then XRCE in Grenoble, where the
  work
  was picked up. It was later commercialized by Xerox's spin-off InXight as
  part of their LinguistX Platform product:
  http://www.inxight.com/products_sp/linguistx/index.html
  I know that Microsoft has developed a similar technology, which is shown
  off
  quite well in their multilingual spelling checking in Word. However, I
  don't think it is available to developers outside of Microsoft. Inso also
  had a competitive technology, which they sold to Lernout & Hauspie. It
  is
  the now called the IntelliScope Language Recognizer, and it is part of
  their IntelliScope Retrieval Toolkit, described here:
  http://www.lhsl.com/tech/icm/retrieval/toolkit/lr.asp
  I can't tell from your brief description, Bob, if n-gram frequencies (under
  a different name) are part of your Mozilla work or not. If they're not,
  they should be. :-)
  The bottom line, folks, is that there are a lot better technologies
  available which allow you to automatically detect encodings, and they come
  with the tremendous additional benefit of being able to identify the
  language as well. We can all imagine lots of ways we could use that
  information. Maybe some of you will start sniffing down a different trail
  for a solution here....
  Merle
  > -----Original Message-----
  > From: bobj@netscape.com [mailto:bobj@netscape.com]
  > Sent: Thursday, September 06, 2001 8:11 PM
  > To: Martin Duerst
  > Cc: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail); Shanjian Li;
  > momoi@netscape.com
  > Subject: Re: auto-detecting the character encoding of an uploaded file
  >
  >
  > FYI, there will be a paper presented at the Nineteenth International
  > Unicode Conference (IUC19), to be held on September 10 - 14,
  > 2001 in San
  > Jose, California :
  >
  > A Composite Approach to Language/Encoding Detection
  > by Shanjian Li & Katsuhiko Momoi - Netscape Communications
  >
  > Session B6: Wednesday, September 12, 2:50 pm - 3:30 pm
  >
  > Abstract: http://www.unicode.org/iuc/iuc19/a322.html
  >
  > And since this is part of Mozilla, it is all open source!
  >
  > http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/
  >
  > -Bob
  >
  > Martin Duerst wrote:
  >
  > > At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote:
  > >
  > >> Lenny ,
  > >>
  > >> Just some thoughts.
  > >>
  > >> Since you have mentioned Shift-JIS,
  > >
  > >
  > > As a charset, spelled shift_jis (case doesn't matter, but the
  > > underscore does).
  > >
  > >
  > >> there is no guarantee that every other
  > >> byte in UTF-16 is zero especially for non-us systems like
  > >> Japanese/European
  > >> .
  > >
  > >
  > > No. But if you see even a single zero byte, then the chance that
  the
  > > document is in UTF-16 is very high.
  > >
  > >
  > >> Also there is no significance for BOM for UTF-8, which
  > means not all
  > >> applications will add a BOM for the UTF-8 text.
  > >
  > >
  > > Yes indeed, for many reasons, adding a BOM to UTF-8 texts is
  > > discouraged. Detecting UTF-8 is easy enough without a BOM.
  > >
  > >
  > >> Finally, I don't think we
  > >> can come up with an auto-detect algorithm for detecting
  > >> Latin-1/UTF-*/Shift-JIS.
  > >
  > >
  > > For all these, it's not too difficult. Shift-JIS uses bytes in
  > > the 0x80-0x9F range, and has specific patterns. If there are
  > > only very few characters outside us-ascii, it may not work,
  > > but with more non-us-ascii characters, the probability
  > > of success is going up very quickly.
  > >
  > >
  > > Regards, Martin.
  > >
  >
  >
  

 

----- Message from vinod_balakrishnan@filemaker.com (Vinod Balakrishnan) on Thu, 6 Sep 2001 10:40:14 -0700 -----

To:

"Martin Duerst" <duerst@w3.org>, "Lenny Turetsky" <LTuretsky@salesforce.com>, "W3intl \(E-mail\)" <www-international@w3.org>

Subject:

RE: auto-detecting the character encoding of an uploaded file

  >For all these, it's not too difficult. Shift-JIS uses bytes in
  >the 0x80-0x9F range, and has specific patterns. If there are
  >only very few characters outside us-ascii, it may not work,
  >but with more non-us-ascii characters, the probability
  >of success is going up very quickly.
  Shift-JIS represent the trailing bytes of double byte in 0x40-0xF0 range (
  only the leading byte is in high ASCII range ) . Also the Hankaku (single
  byte) Kana is represented as single byte in the high ASCII range. A
  Japanese
  text in Shift-JIS contains single byte kana and Kanji characters can be
  misinterpreted as Latin-1
  -Vinod
  -----Original Message-----
  From: Martin Duerst [mailto:duerst@w3.org]
  Sent: Wednesday, September 05, 2001 6:36 PM
  To: vinod@filemaker.com; Lenny Turetsky; W3intl (E-mail)
  Subject: RE: auto-detecting the character encoding of an uploaded file
   
  At 13:51 01/09/05 -0700, Vinod Balakrishnan wrote:
  >Lenny ,
  >
  >Just some thoughts.
  >
  >Since you have mentioned Shift-JIS,
  As a charset, spelled shift_jis (case doesn't matter, but the underscore
  does).
   
  >there is no guarantee that every other
  >byte in UTF-16 is zero especially for non-us systems like
  Japanese/European
  >.
  No. But if you see even a single zero byte, then the chance that the
  document is in UTF-16 is very high.
   
  >Also there is no significance for BOM for UTF-8, which means not all
  >applications will add a BOM for the UTF-8 text.
  Yes indeed, for many reasons, adding a BOM to UTF-8 texts is
  discouraged. Detecting UTF-8 is easy enough without a BOM.
   
  >Finally, I don't think we
  >can come up with an auto-detect algorithm for detecting
  >Latin-1/UTF-*/Shift-JIS.
  For all these, it's not too difficult. Shift-JIS uses bytes in
  the 0x80-0x9F range, and has specific patterns. If there are
  only very few characters outside us-ascii, it may not work,
  but with more non-us-ascii characters, the probability
  of success is going up very quickly.

  Regards, Martin.