Encoding Detection

shlomoy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Encoding Detection by gaal (Parson) on Jan 10, 2005 at 21:31 UTC
You may not be able to tell for a particular text whether it is in Windows-1255 or in ISO-8859-8, but the good news is that in those cases there isn't any difference in interpretation. Likewise, English text may be either ASCII or UTF-8 in some cases with no interpretation difference. I'd suggest the following, which is not optimal but wins for simplicity: If the text validates as UTF-8, it is UTF-8. This is a relatively inexpensive pass on the data. (From here on be in octet mode.) If the text contains 0xDF, interpret is a DOUBLE LOW LINE. It is in ISO-8859-8; this code point is not defined in Windows-1255. Actually, from here you can assume Windows-1255, which is a superset of ISO-8859-8 apart from the undefined character in the previous item. This suggests an efficient algorithm; interleave a UTF-8 validator with a 0xDF detector; if you can assume your input is what you say it is, you have a fast one-pass function. References: Unicode's official notion of ISO-8859-8: ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-8.TXT Windows-1255 reference (Unicode): ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1255.TXT Windows-1255 reference (Microsoft): http://www.microsoft.com/typography/unicode/1255.htm (Credit to some of the research behind this to Anatoly Vorobey.)	[reply]
Re: Encoding Detection by zentara (Cardinal) on Jan 10, 2005 at 21:03 UTC
UPDATE: Mar 21,2006 fixed broken link. It isn't Perl, but this was just announced on freshmeat today, and I'll pass it on in the interest of helping you. programs for examining Unicode files . They work pretty nice. I'm not really a human, but I play one on earth. flash japh	[reply]
Re^2: Encoding Detection by graff (Chancellor) on Mar 21, 2006 at 05:38 UTC
Now that a year has gone by since zentara's post, a more current url for that package by Bill Poser is: http://www.billposer.org/Software/unidesc.html (The relevance of that tool set to this OP's particular question is doubtful, since it does not address non-unicode encodings at all, but the tools are bound to be interesting for people interested in unicode, in general.)	[reply]