SheridanCat has asked for the wisdom of the Perl Monks concerning the following question:

I regularly get a batch of HTML containing information about Latin music. I'm having some problems making sure the non-English characters in that data don't get whacked. Perhaps pertinent, perhaps not, I'm running this on Windows XP with ActivePerl 5.8. This is a test example of the data:
<?xml version='1.0' encoding='ISO-8859-1'?> <data> <title>Más Y Más</title> <artist>La Unión</artist> </data>
Here's the code I'm testing:
#!/usr/bin/perl use warnings; use strict; use XML::Simple; use Data::Dumper; undef $/; open( FH, shift ); my $raw_file = <FH>; my $xml = XMLin( $raw_file, forcearray => [], suppressempty => undef ); print Dumper ( $xml ); print $xml->{'artist'}; print "\n";
I pass the data file to the script and the output is this:
$VAR1 = { 'artist' => "La Uni\x{f3}n", 'title' => "M\x{e1}s Y M\x{e1}s" };
La Uni≤n
So, the dumped data at least has the hex equivalents of the non-English characters in it. The print, however, has done some conversion that I'm not sure about. In any event, this ends up as pretty much junk data. I'm sure there's a simple conversion I'm missing here. I've butted my head up against Unicode::String for awhile, but the results were never satisfactory.
Any wisdom is very much appreciated.
Regards,
SheridanCat

Replies are listed 'Best First'.
Re: Character Conversion Conundrum
by Aristotle (Chancellor) on Dec 22, 2004 at 22:04 UTC

    Try this:

    #!/usr/bin/perl use warnings; use strict; use Encode qw( is_utf8 ); use XML::Simple; use Data::Dumper; my $raw_file = do { local $/; <> }; my $xml = XMLin( $raw_file, forcearray => [], suppressempty => undef, +); print Dumper ( $xml ); print is_utf8 $xml->{'artist'}; print "\n";

    Does Perl say the UTF-8 flag is on? It should not, by the dump of your hash. Interestingly, XML::Simple converts to UTF-8 for me (and that's a good thing; maybe you should look into how to ask it to do so).

    The next question, then, is what encoding your terminal assumes. I have no idea at all how to find that out for a Windows box though… Apparently, it has a different opinion of what chr 0xF3 means than the one defined in ISO-8859-1.

    Makeshifts last the longest.

      Interestingly, XML::Simple converts to UTF-8 for me
      IIRC XML is always supposed to contain unicode data (i.e. a &#number; reference should be understood as a unicode code-point no matter what the file's encoding is), so converting to utf-8 would appear to be a good thing in perl, as perl uses utf-8 for unicode. I would appreciate a pointer to a comprehensive (and clear) reference about XML(-parsers) and character encoding though. I'm just not 100% clear on the whole subject.

        You can always represent all of Unicode in an XML document using entities, but that is a separate issue from the encoding used by a particular XML document and whether and how it gets converted upon parsing. Your post sounds like you have a heap of flawed assumptions about encodings. (To be sure, most people do, I am not scolding you.) Please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) The topic is complex and much harder to consume than first appearances suggest. I have it down fairly solidly at this point (after a good bit of work), and I still occasionally embarrass myself.

        Makeshifts last the longest.

      Thanks for the code. is_utf8 does return 1 in this case.

      Good question on the encoding my console uses. I'll see if I can find out.

      Thanks,

      SheridanCat

        Now that is weird. There's a 0xF3 in there, but the UTF-8 flag is on? 0xF3 0x6E is not a valid UTF-8 sequence. 0xF3 indicates the start of a four-byte wide character (four highest bits set, then a zero bit to terminate the sequence, and 3 bits of payload), but 0x6E means this character it's not part of a sequence (highest bit is zero). That's invalid.

        So the input never actually gets converted to UTF-8, but someone is still flipping the UTF-8 flag on it. And Perl does not complain when printing the string. Weird. Seems like something is rather amiss there. Whether that is the cause for the less-than character you're seeing on the console for some reason is anyone's guess. Assuming these are somewhat older versions of Perl and XML::Simple, maybe you ought to check whether newer ones act consistently.

        I don't really have any suggestions, I'm afraid, I'm kind of at a loss.

        Makeshifts last the longest.