wfsp has asked for the wisdom of the Perl Monks concerning the following question:

I have the faintest of grips on encodings. Can anyone spot what I'm doing wrong here?
#!/usr/bin/perl use warnings; use strict; use Encode; use HTML::Entities; my $txt = q{“£”}; # from a windows text file my $utf8 = decode(q{cp1250}, $txt); print encode_entities($utf8); # “Ł” (an L with a stroke)
  • Comment on win32 txt (with a £) -> decode -> encode_entities -> L with stroke
  • Download Code

Replies are listed 'Best First'.
Re: win32 txt (with a £) -> decode -> encode_entities -> L with stroke
by moritz (Cardinal) on Jan 15, 2009 at 15:51 UTC
    Chances are that the text file is not in cp1250, as you think, or that an IO layer changed the encoding during the read process..

    When you open that text file with a hex editor, what are the bytes (or the byte) corresponding to the £?

    (If you have a Linux system available, hexdump -C is very helpful).

    Update: HTML::Template does handle decoded strings with high codepoints correctly:

    $ perl -MHTML::Entities=encode_entities -wle 'print encode_entities(ch +r hex "20AC")' €

    Second update: wfsp /msg'ed me that the hexdump showed A3. So let's try to simulate this:

    $ perl -we 'print chr(hex "A3")'|perl -MEncode -MHTML::Entities=encode +_entities -wle 'my $x = <>; print encode_entities(decode("cp1250", $x +))' &#x141;

    So, no additional characters, just a &#x141, which is the Unicode codepoint for capital L with stroke, (ie the output is correct).

    So either the additional characters appear in the file, and the output is actually that you got is correct, or there's an additional IO layer somewhere that you haven't told us about (probably because you don't know about it).

      ... &#x141, which is the Unicode codepoint for capital L with stroke, (ie the output is correct).

      I think wfsp's point is that the pound sign (A3) should remain &#xA3; — i.e. the Unicode codepoint U+00A3 (pound sign) vs. U+0141 (capital L with stroke).  IOW, I don't think the output is correct...

      Update: as ikegami points out, cp1250 does not correspond to Latin-1 (ISO 8859-1), as I was misled to assume (and maybe wfsp, too?) — the difference between cp1250 and cp1252 then of course explains the output...

        According to Wikipedia.

        • iso-8859-1's A3 codepoint is the pound sign (U+00A3).
        • cp1252's A3 codepoint is the pound sign (U+00A3) since it's based on iso-8859-1.
        • iso-8859-2's A3 codepoint is uppercase L with stroke (U+0141).
        • cp1250's A3 codepoint is uppercase L with stroke (U+0141) since it's based on iso-8859-2.

        If this information is accurate, Encode is producing the proper output and wfsp's expectations are wrong.

        use Encode qw( decode ); for (qw( iso-8859-1 cp1252 iso-8859-2 cp1250 )) { printf( "%-11s U+%04X\n", "$_:", ord( decode($_, "\xA3") ) ); }
        iso-8859-1: U+00A3 cp1252: U+00A3 iso-8859-2: U+0141 cp1250: U+0141

        Update: Added to node.

      print decode(q{cp1250}, chr(0xA3));
      outputs the L with stroke. Could a US English mismatch cause this?