in reply to Character Conversion Conundrum

Try this:

#!/usr/bin/perl use warnings; use strict; use Encode qw( is_utf8 ); use XML::Simple; use Data::Dumper; my $raw_file = do { local $/; <> }; my $xml = XMLin( $raw_file, forcearray => [], suppressempty => undef, +); print Dumper ( $xml ); print is_utf8 $xml->{'artist'}; print "\n";

Does Perl say the UTF-8 flag is on? It should not, by the dump of your hash. Interestingly, XML::Simple converts to UTF-8 for me (and that's a good thing; maybe you should look into how to ask it to do so).

The next question, then, is what encoding your terminal assumes. I have no idea at all how to find that out for a Windows box though… Apparently, it has a different opinion of what chr 0xF3 means than the one defined in ISO-8859-1.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: Character Conversion Conundrum
by Joost (Canon) on Dec 22, 2004 at 22:38 UTC
    Interestingly, XML::Simple converts to UTF-8 for me
    IIRC XML is always supposed to contain unicode data (i.e. a &#number; reference should be understood as a unicode code-point no matter what the file's encoding is), so converting to utf-8 would appear to be a good thing in perl, as perl uses utf-8 for unicode. I would appreciate a pointer to a comprehensive (and clear) reference about XML(-parsers) and character encoding though. I'm just not 100% clear on the whole subject.

      You can always represent all of Unicode in an XML document using entities, but that is a separate issue from the encoding used by a particular XML document and whether and how it gets converted upon parsing. Your post sounds like you have a heap of flawed assumptions about encodings. (To be sure, most people do, I am not scolding you.) Please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) The topic is complex and much harder to consume than first appearances suggest. I have it down fairly solidly at this point (after a good bit of work), and I still occasionally embarrass myself.

      Makeshifts last the longest.

        Thanks for the link. I'd already read it, and although it's really informative, it doesn't touch on the subject of the interaction between XML files and unicode at all.

        Let's give an example of the kind of thing I'm worried about:

        <?xml version="1.0" encoding="ISO-8859-1"?> <something> &#1040; </something>
        Now, what is an XML parser supposed to do with this? My understanding is that it always should return "unicode" data in some way - how else can it interpret the &#1040; character? But I've been as of yet unable to find any official reference on this situation. Maybe I'm just being stupid, or I've looked in the wrong places, but I can't find any "offical" backup for or against my intuition.

Re^2: Character Conversion Conundrum
by SheridanCat (Pilgrim) on Dec 23, 2004 at 17:28 UTC
    Thanks for the code. is_utf8 does return 1 in this case.

    Good question on the encoding my console uses. I'll see if I can find out.

    Thanks,

    SheridanCat

      Now that is weird. There's a 0xF3 in there, but the UTF-8 flag is on? 0xF3 0x6E is not a valid UTF-8 sequence. 0xF3 indicates the start of a four-byte wide character (four highest bits set, then a zero bit to terminate the sequence, and 3 bits of payload), but 0x6E means this character it's not part of a sequence (highest bit is zero). That's invalid.

      So the input never actually gets converted to UTF-8, but someone is still flipping the UTF-8 flag on it. And Perl does not complain when printing the string. Weird. Seems like something is rather amiss there. Whether that is the cause for the less-than character you're seeing on the console for some reason is anyone's guess. Assuming these are somewhat older versions of Perl and XML::Simple, maybe you ought to check whether newer ones act consistently.

      I don't really have any suggestions, I'm afraid, I'm kind of at a loss.

      Makeshifts last the longest.

        Thanks for the the help nonetheless. I haven't figured out the problem yet. Just to make this complete, I'm using ActivePerl 5.8.3 and XML::Simple is 2.09, so I seem to be up-to-date.

        Very strange. I'll keep plugging away.

        Thanks

        SheridanCat