omacneil has asked for the wisdom of the Perl Monks concerning the following question:

We have a database dump that is (mostly) in iso-8859-01 or more likely in windows 1252. The database was populated by a web form that encouraged browsers to give us text in these charsets because it's head section included:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +">

...and Internet Exploder interprets iso-8859-01 as license to give you Windows 1252

We are converting to utf-8 , becasue it is clearer and better supports non-English characters

The somewhat simplified code is

use Encode qw(from_to); binmode DATA, ':utf8'; my $original=<DATA>; my $converted=$original; from_to($converted,'iso-8859-01','utf-8'); from_to($converted,'utf-8','iso-8859-01'); print $converted eq $original?'round trip ok':'changed'; __DATA__ some chars that in reality aren't all low ascii

Our problem is that our utf-8 output doesn't show up correctly in the terminal. As near as we can tell from the  locale command and the Terminal->Set Character encoding menu item in gnome-terminal

For example, some of our converted output in utf8 contains a bunch of 0xC2 and 0xC3 (194 & 195) chars

perl -e 'binmode STDOUT,":utf8"; print chr(0xC2),"\n";'

...Gives a LATIN CAPITAL A WITH CIRCUMFLEX (according to gnome-character-map), which is not in the input.

maybe we don't know what char set the input is in?

UPDATE: set binmode per Anonymous Friend

Replies are listed 'Best First'.
Re: display of utf8
by moritz (Cardinal) on Aug 12, 2009 at 06:32 UTC

    For example, some of our converted output in utf8 contains a bunch of 0xC2 and 0xC3 (194 & 195) chars

    perl -e 'binmode STDOUT,":utf8"; print chr(0xC2),"\n";'

    ...Gives a LATIN CAPITAL A WITH CIRCUMFLEX (according to gnome-character-map), which is not in the input.

    UTF-8 is a multi byte encoding, and when you see a 0xc2 byte in the output that's the start of two byte sequence that encodes a character from the range U+0080-U+07FF. It does not mean that the codepoint associated with U+00c2 should be displayed - that would only happen if your terminal were Latin-1 (or compatible).

Re: display of utf8
by Anonymous Monk on Aug 12, 2009 at 04:53 UTC
    You should binmode DATA; You shouldn't rely on your terminal, you should use hexdump or od
Re: display of utf8
by ikegami (Patriarch) on Aug 12, 2009 at 15:10 UTC
    use Encode qw( encode ); binmode(DATA, ':encoding(UTF-8)'); while (<DATA>) { print encode('iso-8859-1', $_, Encode::FB_HTMLCREF) }
Re: display of utf8
by grantm (Parson) on Aug 14, 2009 at 00:19 UTC
    You might want to look at Encoding-FixLatin for dealing with data that might be ISO-8859-1 or CP-1252 or UTF-8.