in reply to Removing Unsafe Characters

You need to make sure your files are properly encoded (eg utf8) and the document charset matches.

BTW, there are no unsafe characters in html, but literal < ' & " > do need to be encoded when required :)

Replies are listed 'Best First'.
Re^2: Removing Unsafe Characters
by Praethen (Scribe) on Apr 28, 2009 at 07:34 UTC

    Thanks. In this case the files are actually emails that have been parsed over many years from many different ISPs. I doubt there is any uniformity in their original encodings (nor are any email headers maintained in the files, only the email bodies and some other relevant data) and I don't have the technical knowledge on how best to deal with such a situation. That said, I'll review Perl encodings in the morning.

    As far as encoding literal < ' & " > I can only rely on the Mail Providers to have properly done that to begin with or the situation is hopeless. (i.e. I can't easily guess which < is intended to be an HTML start delimiter an email quoting method or just someone pointing)

    update: Well, it seems this is the can of worms I feared to open. I admit it is all very much above my head in terms of technical understanding. This wouldn't be a major issue if I were paid to work on this problem but I am a tinkerer. I just don't understand perl and encodings enough to fully grasp the problem, let alone the solution.

    The server does return UTF-8 Charset. Which, after googling what characterset does perl encode in, seems to be Unicode UTF-8. This may well be a problem I cannot tackle effectively but hopefully some of the solutions here will work. Thanks.