Thanks. In this case the files are actually emails that have been parsed over many years from many different ISPs. I doubt there is any uniformity in their original encodings (nor are any email headers maintained in the files, only the email bodies and some other relevant data) and I don't have the technical knowledge on how best to deal with such a situation. That said, I'll review Perl encodings in the morning.

As far as encoding literal < ' & " > I can only rely on the Mail Providers to have properly done that to begin with or the situation is hopeless. (i.e. I can't easily guess which < is intended to be an HTML start delimiter an email quoting method or just someone pointing)

update: Well, it seems this is the can of worms I feared to open. I admit it is all very much above my head in terms of technical understanding. This wouldn't be a major issue if I were paid to work on this problem but I am a tinkerer. I just don't understand perl and encodings enough to fully grasp the problem, let alone the solution.

The server does return UTF-8 Charset. Which, after googling what characterset does perl encode in, seems to be Unicode UTF-8. This may well be a problem I cannot tackle effectively but hopefully some of the solutions here will work. Thanks.


In reply to Re^2: Removing Unsafe Characters by Praethen
in thread Removing Unsafe Characters by Praethen

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.