Character set cleanup (or something like that)...

devnul has asked for the wisdom of the Perl Monks concerning the following question:

I'm having a heckuva time parsing some XML.. :-(

Problem #1: Some of the XML contains characters which look like '\xb7'... To me that "sort-of" says ISO-8859-1

Problem #2: But that screws up other XML documents which I need to parse with the same script which have "&" instead of "&" in them..

Is there any "easy" way to clean this up and get it into UTF-8, UTF-16, or ISO-8859-1?..

.. I admit to not knowing a whole lot about these encoding issues, but hopefully this makes some sense... :-)

- Greg

Comment on Character set cleanup (or something like that)...

Replies are listed 'Best First'.
Re: Character set cleanup (or something like that)... by bronto (Priest) on Apr 08, 2004 at 07:58 UTC
Yes, the Encode module, which is in bundle with perl 5.8 (see the `from_to` function) Ciao! `--bronto` The very nature of Perl to be like natural language--inconsistant and full of dwim and special cases--makes it impossible to know it all without simply memorizing the documentation (which is not complete or totally correct anyway). --John M. Dlugosz	[reply] [d/l]
Re: Character set cleanup (or something like that)... by devnul (Monk) on Apr 08, 2004 at 07:45 UTC
.. I should probably add that I am trying to parse a RSS feed, using the XML::RSS:Parser module (which uses XML::Parser (which uses expat (I think)..)..)... - Greg	[reply]
Re: Character set cleanup (or something like that)... by graff (Chancellor) on Apr 08, 2004 at 22:27 UTC
I'd be curious to know what you really meant to say for "problem #2", because when I read it, it said 'But that screws up other XML documents ... which have "&" instead of "&" in them', and I assume that you meant to say something else... If you meant to say that there are entity references in some files and literal non-ASCII characters in others, you may want to look up the HTML::Entities module in order to convert the entity references to their corresponding literal characters in utf8. But without a better idea of what sort of data you're facing, it's hard to give suitable advice. Determining whether or not "\xb7" indicates 8859-1 depends on the context. Do the surrounding characters make it plausible that "\xb7" is really being used as a "middle-dot" (e.g. as a "bullet-point" in an unordered list, or as punctuation within a numeric string)? Even if the context does suggest that this is the correct interpretation for this code point, there still may be doubt about the particular character set you're dealing with -- most of the other ISO-8859 pages have "middle-dot" for "\xb7", but differ in many other places. You need to have some additional evidence (possibly some external assurance from the data provider) to be certain how to interpret the non-ASCII bytes. Once you're sure about that, then use the Encode module's "decode" function to translate that into utf8.	[reply]