Re: Character set cleanup (or something like that)...

I'd be curious to know what you really meant to say for "problem #2", because when I read it, it said 'But that screws up other XML documents ... which have "&" instead of "&" in them', and I assume that you meant to say something else...

If you meant to say that there are entity references in some files and literal non-ASCII characters in others, you may want to look up the HTML::Entities module in order to convert the entity references to their corresponding literal characters in utf8. But without a better idea of what sort of data you're facing, it's hard to give suitable advice.

Determining whether or not "\xb7" indicates 8859-1 depends on the context. Do the surrounding characters make it plausible that "\xb7" is really being used as a "middle-dot" (e.g. as a "bullet-point" in an unordered list, or as punctuation within a numeric string)? Even if the context does suggest that this is the correct interpretation for this code point, there still may be doubt about the particular character set you're dealing with -- most of the other ISO-8859 pages have "middle-dot" for "\xb7", but differ in many other places. You need to have some additional evidence (possibly some external assurance from the data provider) to be certain how to interpret the non-ASCII bytes. Once you're sure about that, then use the Encode module's "decode" function to translate that into utf8.

Comment on Re: Character set cleanup (or something like that)...