Re: How to Identify a language

It depends on what kind of data you have. If it's all in unicode, you can narrow down the possible languages for a given piece of text data simply by looking at the ranges of code points in the text -- chapter 2 ("General Structure") of the Unicode Standard (find the pdf file here: http://www.unicode.org/versions/Unicode4.1.0/) gives a nice overview of language-specific code-point ranges; you can get more details about character mappings on a per-language basis here: http://www.unicode.org/charts/.

Even among the Asian languages with large character inventories, each language may tend to use characters that the other Asian languages do not use.

To the extent that the same characters are used in two or more languages, the frequency ranking of the most commonly used characters in each language will tend to be distinctive, but you need a good sample of known text in each language (at least 50,000 characters in size for any language that uses CJK characters) in order to get good-enough statistics, and even then, the reliability of identification will depend on the size of the text you are trying to identify.

(update: the frequency ranking of character bi-grams will be even more distinctive; you will tend to need more training data to get good statistics, but you can get more reliable results when trying to identify smaller amounts of unknown data.)

If you have data with "legacy" (non-unicode) encodings, like KSC, GB, Big5, Shift-JIS, etc, the encoding tends to correlate with the language, and for that, you can try Encode::Guess, which is actually best suited for identifying among the various Asian legacy encodings.

Comment on Re: How to Identify a language

Replies are listed 'Best First'.
Re^2: How to Identify a language by Gavin (Archbishop) on Sep 18, 2006 at 18:33 UTC
Further to what Graff has already pointed out Re stop words and special characters for language recognition. A Dr Benedetto found that by comparing the entropy of two texts the language could be determined. I do not know if there is a perl module that would suit the bill but I think that perhaps someone may have more knowledge on zip programs. Gavin	[reply]
Re^2: How to Identify a language by Anonymous Monk on Sep 19, 2006 at 16:08 UTC
If you want to use code-point ranges, Unicode property classes (see perlunicode) are pretty handy. `sub identify_CJK { local $_ = shift; return "J" if /\p{Hiragana}\|\p{Katakana}/; return "K" if /\p{Hangul}/; return "C" if /\p{Han}/; return "Others"; # Note that the order matters because Japanese text # most likely contains Hanzi (Kanji) characters and # so does Korean text (less frequently though). }` [download] I think it works in most cases as long as all the texts you want to test are converted to Perl's internal representation of strings (with utf8 flag on).	[reply] [d/l]