shlomoy has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to process texts (actually, those are HTML pages) that contain text in either or both languages: {English, Hebrew}. The Hebrew text is written either in CP1255 or in ISO-8859-8 or in UTF-8. My questions: How can I detect which encoding is used in the texts that I process?
--- Shlomo Yona
http://cs.haifa.ac.il/~shlomo/

Replies are listed 'Best First'.
Re: Encoding Detection
by gaal (Parson) on Jan 10, 2005 at 21:31 UTC
    You may not be able to tell for a particular text whether it is in Windows-1255 or in ISO-8859-8, but the good news is that in those cases there isn't any difference in interpretation.

    Likewise, English text may be either ASCII or UTF-8 in some cases with no interpretation difference.

    I'd suggest the following, which is not optimal but wins for simplicity:

    1. If the text validates as UTF-8, it is UTF-8. This is a relatively inexpensive pass on the data.
    2. (From here on be in octet mode.) If the text contains 0xDF, interpret is a DOUBLE LOW LINE. It is in ISO-8859-8; this code point is not defined in Windows-1255.
    3. Actually, from here you can assume Windows-1255, which is a superset of ISO-8859-8 apart from the undefined character in the previous item.

    This suggests an efficient algorithm; interleave a UTF-8 validator with a 0xDF detector; if you can assume your input is what you say it is, you have a fast one-pass function.

    References:

    Unicode's official notion of ISO-8859-8:
    ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-8.TXT

    Windows-1255 reference (Unicode):
    ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1255.TXT

    Windows-1255 reference (Microsoft): http://www.microsoft.com/typography/unicode/1255.htm

    (Credit to some of the research behind this to Anatoly Vorobey.)

Re: Encoding Detection
by zentara (Cardinal) on Jan 10, 2005 at 21:03 UTC
    UPDATE: Mar 21,2006 fixed broken link.

    It isn't Perl, but this was just announced on freshmeat today, and I'll pass it on in the interest of helping you. programs for examining Unicode files . They work pretty nice.


    I'm not really a human, but I play one on earth. flash japh
      Now that a year has gone by since zentara's post, a more current url for that package by Bill Poser is:

      http://www.billposer.org/Software/unidesc.html

      (The relevance of that tool set to this OP's particular question is doubtful, since it does not address non-unicode encodings at all, but the tools are bound to be interesting for people interested in unicode, in general.)