I'd be curious to know what you really meant to say for "problem #2", because when I read it, it said 'But that screws up other XML documents ... which have "&" instead of "&" in them', and I assume that you meant to say something else...

If you meant to say that there are entity references in some files and literal non-ASCII characters in others, you may want to look up the HTML::Entities module in order to convert the entity references to their corresponding literal characters in utf8. But without a better idea of what sort of data you're facing, it's hard to give suitable advice.

Determining whether or not "\xb7" indicates 8859-1 depends on the context. Do the surrounding characters make it plausible that "\xb7" is really being used as a "middle-dot" (e.g. as a "bullet-point" in an unordered list, or as punctuation within a numeric string)? Even if the context does suggest that this is the correct interpretation for this code point, there still may be doubt about the particular character set you're dealing with -- most of the other ISO-8859 pages have "middle-dot" for "\xb7", but differ in many other places. You need to have some additional evidence (possibly some external assurance from the data provider) to be certain how to interpret the non-ASCII bytes. Once you're sure about that, then use the Encode module's "decode" function to translate that into utf8.


In reply to Re: Character set cleanup (or something like that)... by graff
in thread Character set cleanup (or something like that)... by devnul

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.