in reply to Re: Removing Unsafe Characters
in thread Removing Unsafe Characters

Thanks. In this case the files are actually emails that have been parsed over many years from many different ISPs. I doubt there is any uniformity in their original encodings (nor are any email headers maintained in the files, only the email bodies and some other relevant data) and I don't have the technical knowledge on how best to deal with such a situation. That said, I'll review Perl encodings in the morning.

As far as encoding literal < ' & " > I can only rely on the Mail Providers to have properly done that to begin with or the situation is hopeless. (i.e. I can't easily guess which < is intended to be an HTML start delimiter an email quoting method or just someone pointing)

update: Well, it seems this is the can of worms I feared to open. I admit it is all very much above my head in terms of technical understanding. This wouldn't be a major issue if I were paid to work on this problem but I am a tinkerer. I just don't understand perl and encodings enough to fully grasp the problem, let alone the solution.

The server does return UTF-8 Charset. Which, after googling what characterset does perl encode in, seems to be Unicode UTF-8. This may well be a problem I cannot tackle effectively but hopefully some of the solutions here will work. Thanks.