in reply to win32 txt (with a £) -> decode -> encode_entities -> L with stroke

Chances are that the text file is not in cp1250, as you think, or that an IO layer changed the encoding during the read process..

When you open that text file with a hex editor, what are the bytes (or the byte) corresponding to the £?

(If you have a Linux system available, hexdump -C is very helpful).

Update: HTML::Template does handle decoded strings with high codepoints correctly:

$ perl -MHTML::Entities=encode_entities -wle 'print encode_entities(ch +r hex "20AC")' €

Second update: wfsp /msg'ed me that the hexdump showed A3. So let's try to simulate this:

$ perl -we 'print chr(hex "A3")'|perl -MEncode -MHTML::Entities=encode +_entities -wle 'my $x = <>; print encode_entities(decode("cp1250", $x +))' &#x141;

So, no additional characters, just a &#x141, which is the Unicode codepoint for capital L with stroke, (ie the output is correct).

So either the additional characters appear in the file, and the output is actually that you got is correct, or there's an additional IO layer somewhere that you haven't told us about (probably because you don't know about it).

Replies are listed 'Best First'.
Re^2: win32 txt (with a £) -> decode -> encode_entities -> L with stroke
by almut (Canon) on Jan 15, 2009 at 17:39 UTC
    ... &#x141, which is the Unicode codepoint for capital L with stroke, (ie the output is correct).

    I think wfsp's point is that the pound sign (A3) should remain &#xA3; — i.e. the Unicode codepoint U+00A3 (pound sign) vs. U+0141 (capital L with stroke).  IOW, I don't think the output is correct...

    Update: as ikegami points out, cp1250 does not correspond to Latin-1 (ISO 8859-1), as I was misled to assume (and maybe wfsp, too?) — the difference between cp1250 and cp1252 then of course explains the output...

      According to Wikipedia.

      • iso-8859-1's A3 codepoint is the pound sign (U+00A3).
      • cp1252's A3 codepoint is the pound sign (U+00A3) since it's based on iso-8859-1.
      • iso-8859-2's A3 codepoint is uppercase L with stroke (U+0141).
      • cp1250's A3 codepoint is uppercase L with stroke (U+0141) since it's based on iso-8859-2.

      If this information is accurate, Encode is producing the proper output and wfsp's expectations are wrong.

      use Encode qw( decode ); for (qw( iso-8859-1 cp1252 iso-8859-2 cp1250 )) { printf( "%-11s U+%04X\n", "$_:", ord( decode($_, "\xA3") ) ); }
      iso-8859-1: U+00A3 cp1252: U+00A3 iso-8859-2: U+0141 cp1250: U+0141

      Update: Added to node.

        ...wfsp's expectations are wrong.
        Yup.

        Thanks for straightening out my muddle.

Re^2: win32 txt (with a £) -> decode -> encode_entities -> L with stroke
by wfsp (Abbot) on Jan 15, 2009 at 16:18 UTC
    print decode(q{cp1250}, chr(0xA3));
    outputs the L with stroke. Could a US English mismatch cause this?