in reply to What's the best way to detect character encodings, Windows-1252 v. UTF-8?

I agree with bartmoritz. Due to some properties of UTF-8, it's very unlikely that cp1252-encoded text would be valid UTF-8*.

use Encode qw( decode ); my $bytes = '...'; my $txt; if (!eval { $txt = decode('UTF-8', $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC); 1 # No exception }) { $txt = decode('Windows-1252', $bytes); }

* — Unless the encoded text contains no bytes above 0x7F, in which case it doesn't matter if you treat it as Windows-1252 or UTF-8.

  • Comment on Re: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
  • Download Code

Replies are listed 'Best First'.
Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8? (Areas of confusion)
by ikegami (Patriarch) on Jun 17, 2011 at 15:53 UTC

    That code would only guess wrong if all of the following are true:

    • The text is encoded using Windows-1252 (or iso-8859-1),
    • At least one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷] is present,
    • All instances of [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞß] are always followed by exactly one of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • All instances of [àáâãäåæçèéêëìíîï] are always followed by exactly two of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • All instances of [ðñòóôõö÷] are always followed by exactly three of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
    • None of [øùúûüýþÿ] are present, and
    • None of [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿] are present except where previously mentioned.

    In other words, that code is very reliable.

Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by Jim (Curate) on Jun 17, 2011 at 16:10 UTC
    my $bytes = '...';

    How do I ensure that $bytes are bytes, not characters? I'm on Microsoft Windows and the text files are in the DOS format (i.e., CR-LF newlines) In other words, what I/O layer must I use? '<:raw'?

    Jim

      open(my $fh, '<:raw:perlio', $qfn)
      and
      open(my $fh, '<', $qfn) binmode($fh);
      would do, but then you'd have to do CRLF translation.
      open(my $fh, '<', $qfn)
      will actually work and properly do the CRLF translation (unless you set some default layers somewhere) despite decoding and CRLF translation being done in the wrong order. Note that
      open(my $fh, '<:encoding(UTF-8)', $qfn)
      also decodes and does CRLF translation in the wrong order. That's why
      open(my $fh, '<:encoding(UTF-16le)', $qfn)
      doesn't work on Windows (of all places!).

        So I think you're saying I should do the simplest thing and just open the files without specifying any I/O layer. In this case, Perl will do what I want. It will slurp the bytes of the file into a variable that it understands contains bytes, not characters, and it will also do what I want it to do with newlines, which is effectively to pass them through unmolested.

        What does '<:raw:perlio' do, exactly?

        Jim

Re^2: What's the best way to detect character encodings, Windows-1252 v. UTF-8?
by Jim (Curate) on Jun 17, 2011 at 15:56 UTC

    Thank you very much, ikegami.

    Unless it's valid US-ASCII, in which case it doesn't matter if you use Windows-1252 or UTF-8.

    Yep. Any purely ASCII text files will simply get a UTF-8 byte order mark prefixed to them, forcing them into Unicode goodness.

    EBCDIC text files will be blown to smithereens. In the context of what I'm doing, I don't care.

    Jim

      • A purely US-ASCII text file cannot contain a Unicode BOM.
      • BOM don't force Unicode goodness, whatever that means.
      • I don't know why you bring up EBCDIC. You said only Windows-1252 and UTF-8 are possible.

      I changed the wording of the text you quoted in the hopes of being clearer.

        Uh, I was writing whimsically and lightheardedly. (My goodness, you can find fault and contention in the most inocuous and innocent places, ikegami.)

        I know an ASCII text file cannot contain a Unicode BOM. The whole point of what I'm doing is to convert all the text files to Unicode if they aren't Unicode already. A purely ASCII text file is also a Unicode text file, just as it is also a text file in almost all other character encodings (but not EBCDIC, for example). So I'm going to add a BOM to all purely ASCII text files to make them not purely ASCII text files anymore. I'm doing this because, for better or worse, the world is now full of software that requires Unicode and is insistent that the Unicode-ness be unequivocal (i.e., that the text includes a BOM).

        I mentioned EBCDIC as a lark. Smile, would ya! :-)

        Thank you again for your help.

        Jim