win32 txt (with a Ł) -> decode -> encode

wfsp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: win32 txt (with a Ł) -> decode -> encode_entities -> L with stroke by moritz (Cardinal) on Jan 15, 2009 at 15:51 UTC
Chances are that the text file is not in cp1250, as you think, or that an IO layer changed the encoding during the read process.. When you open that text file with a hex editor, what are the bytes (or the byte) corresponding to the Ł? (If you have a Linux system available, `hexdump -C` is very helpful). Update: HTML::Template does handle decoded strings with high codepoints correctly: `$ perl -MHTML::Entities=encode_entities -wle 'print encode_entities(ch +r hex "20AC")' €` [download] Second update: wfsp /msg'ed me that the hexdump showed A3. So let's try to simulate this: `$ perl -we 'print chr(hex "A3")'\|perl -MEncode -MHTML::Entities=encode +_entities -wle 'my $x = <>; print encode_entities(decode("cp1250", $x +))' Ł` [download] So, no additional characters, just a `&#x141`, which is the Unicode codepoint for capital L with stroke, (ie the output is correct). So either the additional characters appear in the file, and the output is actually that you got is correct, or there's an additional IO layer somewhere that you haven't told us about (probably because you don't know about it).	[reply] [d/l] [select]
Re^2: win32 txt (with a Ł) -> decode -> encode_entities -> L with stroke by almut (Canon) on Jan 15, 2009 at 17:39 UTC
... `&#x141`, which is the Unicode codepoint for capital L with stroke, (ie the output is correct). I think wfsp's point is that the pound sign (`A3`) should remain `£` — i.e. the Unicode codepoint U+00A3 (pound sign) vs. U+0141 (capital L with stroke). ~~IOW, I don't think the output is correct...~~ Update: as ikegami points out, cp1250 does not correspond to Latin-1 (ISO 8859-1), as I was misled to assume (and maybe wfsp, too?) — the difference between cp1250 and cp1252 then of course explains the output...	[reply] [d/l] [select]
Re^3: win32 txt (with a Ł) -> decode -> encode_entities -> L with stroke by ikegami (Patriarch) on Jan 15, 2009 at 17:56 UTC
According to Wikipedia. iso-8859-1's A3 codepoint is the pound sign (U+00A3). cp1252's A3 codepoint is the pound sign (U+00A3) since it's based on iso-8859-1. iso-8859-2's A3 codepoint is uppercase L with stroke (U+0141). cp1250's A3 codepoint is uppercase L with stroke (U+0141) since it's based on iso-8859-2. If this information is accurate, Encode is producing the proper output and wfsp's expectations are wrong. `use Encode qw( decode ); for (qw( iso-8859-1 cp1252 iso-8859-2 cp1250 )) { printf( "%-11s U+%04X\n", "$_:", ord( decode($_, "\xA3") ) ); }` [download] `iso-8859-1: U+00A3 cp1252: U+00A3 iso-8859-2: U+0141 cp1250: U+0141` [download] Update: Added to node.	[reply] [d/l] [select]
Re^4: win32 txt (with a Ł) -> decode -> encode_entities -> L with stroke by wfsp (Abbot) on Jan 15, 2009 at 18:35 UTC
Re^2: win32 txt (with a Ł) -> decode -> encode_entities -> L with stroke by wfsp (Abbot) on Jan 15, 2009 at 16:18 UTC
`print decode(q{cp1250}, chr(0xA3));` [download] outputs the L with stroke. Could a US English mismatch cause this?	[reply] [d/l]