How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use?

I use trial-and-error. I first try to treat it as utf8; if that doesn't throw an error, I'm done. (Also, utf8 might be the most likely outcome anyway.) If the text is not uft8, trying to read it as utf8 will definitely fail, and I'll know for certain that it's some other encoding.

In the latter case, I hope I have some idea of what (human) language the text is supposed to contain, because that will guide how I check for other encodings.

For example, if the language is not Chinese, Japanese or Korean (CJK), the writing system will be one or another alphabet set, usually requiring less than 128 distinct code points; in this case, a UTF-16 encoding will have a rather lopsided byte histogram, because half the bytes (the ones for the upper 8 bits of each character) will have a very limited distribution of values: lots of nulls, and (depending on the language), lots of, say, 0x06 (if it's Arabic) or 0x04 (if it's Cyrillic), etc. Seeing whether these values occur at even or odd byte offsets will reveal whether the UTF-16 is BE or LE.

If the text is supposed to be CJK (and it isn't utf8), I'll go right to Encode::Guess. Likewise if the text is clearly not a 16-bit encoding (i.e. it's not CJK, not UTF-16, and not utf8).

You could probably rely more heavily on Encode::Guess for more of the scenarios, in order to reduce the manual effort. But there are bound to be cases where you really just need to have a human involved (ideally one who knows the language being used in the text).

Bigram statistics for each "language/encoding" tuple serves well as a discriminator, but this depends on having reliable training data for each tuple. If you happen to be dealing with a closed set of possible input types, and just need an automatic way to differentiate between them, you only need a few hundred KB of text per language/encoding tuple to get fairly distinctive bigram statistics.

In effect, in languages that use single-byte encodings, pair-wise byte sequences fall into fairly predictable rankings in terms of frequency of occurrence, and the rankings are distinct from one language to the next. Extending this to CJK would involve a larger quantity of training data, and/or doing statistics on 4-byte sequences (i.e. pairings of 16-bit characters).


In reply to Re: How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use? by graff
in thread How do you open/read a text , without knowing its encoding, and remove any BOM if its utf, what do you use? by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.