It depends on what kind of data you have. If it's all in unicode, you can narrow down the possible languages for a given piece of text data simply by looking at the ranges of code points in the text -- chapter 2 ("General Structure") of the Unicode Standard (find the pdf file here: http://www.unicode.org/versions/Unicode4.1.0/) gives a nice overview of language-specific code-point ranges; you can get more details about character mappings on a per-language basis here: http://www.unicode.org/charts/.
Even among the Asian languages with large character inventories, each language may tend to use characters that the other Asian languages do not use.
To the extent that the same characters are used in two or more languages, the frequency ranking of the most commonly used characters in each language will tend to be distinctive, but you need a good sample of known text in each language (at least 50,000 characters in size for any language that uses CJK characters) in order to get good-enough statistics, and even then, the reliability of identification will depend on the size of the text you are trying to identify.
(update: the frequency ranking of character bi-grams will be even more distinctive; you will tend to need more training data to get good statistics, but you can get more reliable results when trying to identify smaller amounts of unknown data.)
If you have data with "legacy" (non-unicode) encodings, like KSC, GB, Big5, Shift-JIS, etc, the encoding tends to correlate with the language, and for that, you can try Encode::Guess, which is actually best suited for identifying among the various Asian legacy encodings. | [reply] |
Further to what Graff has already pointed out Re stop words and special characters for language recognition.
A Dr Benedetto found that by comparing the entropy of two texts the language could be determined.
I do not know if there is a perl module that would suit the bill but I think that perhaps someone may have more knowledge on zip programs.
Gavin
| [reply] |
If you want to use code-point ranges, Unicode property classes (see perlunicode) are pretty handy.
sub identify_CJK {
local $_ = shift;
return "J" if /\p{Hiragana}|\p{Katakana}/;
return "K" if /\p{Hangul}/;
return "C" if /\p{Han}/;
return "Others";
# Note that the order matters because Japanese text
# most likely contains Hanzi (Kanji) characters and
# so does Korean text (less frequently though).
}
I think it works in most cases as long as all the texts you want to test are converted to Perl's internal representation of strings (with utf8 flag on).
| [reply] [d/l] |
Not a general solution, of course, but Japanese is very easy to distinguish from other CJK languages. Pretty much any Japanese text will contain a significant amount of hiragana characters (syllabary), and no other languages use hiragana (except occasionally to spell Japanese words).
Likewise, modern Korean is almost exclusively written in Korean syllabary these days. | [reply] |
Thinking over my initial reply above, I realized a few things:
- In order to get a really good and proper histogram over unicode text, you'd want to be able to see not only how many times each character occurs, but also which "charts" (which language/function subgroups of unicode characters) are represented in the data, and what their respective frequencies are.
- Getting a proper histogram of that sort involves a fair bit of drudgeful coding and looking up all the details about what the various "charts" really are (their names and the characters they contain); just telling someone "you should try doing that" is sort of infelicitous, bordering on rude.
- I've really been wanting to have just such a tool myself for some time now, and it's about time I got around to that.
So here it is: unichist -- count/summarize characters in data. (I've tested most of its functionality, but not all of it...) | [reply] |
If you just want a program to classify text you might also be interested in: TextCat.
It's a Perl script that uses "N-Gram-Based Text Categorization" and has worked for me in the past. Though I did not need to classify Asian languages, it's supposed to support CJK.
A list of languages and an article discussing the approach can be found on the page as well.
| [reply] |
Looking for something totally unrelated, I ran across this article: On Search: I18n, which says:
... if you need to do serious text processing in the CJK domain and you’re not already a native-speaker with computer programming experience in the space, you should purchase Ken Lunde’s excellent book on the subject from O’Reilly.
The quote would appear to pertain to this book: CJKV Information Processing, by Ken Lunde.
HTH,
| [reply] |