in reply to Re: Removing Unsafe Characters
in thread Removing Unsafe Characters

Part 1: $encoded = encode_entities($input, "\xA0-\x{FFFD}"); -- Sadly it didn't work.

I then began to try to investigate the actual encoding used for the files. Maybe if I can figure out that, then I can figure out how to properly convert them.

I don't have File::MMagic as suggested at How do I determine encoding format of a file ? but I do have Encode::Guess, I got that running and immediately got Unknown encoding error exactly at the place where I have a garbage character. When running Encode::Guess on the data as a string (instead of an array) I got No appropriate encodings found!

I focused in on this character, maybe it could give some clues as to my problem. I used the ord() function to try and isolate the character. Two characters return junk and their decimal equivalents are 226 and 128. The 226 is valid but 128 isn't. To top all of that, I'm positive that the user's intended character was a hyphen.

I feel even more lost than when I started. None of the solutions provided work properly, I either get more junk characters or I get valid characters that shouldn't be there at all.

I think I'll give up on this question and try and chase down how to determine what the character encoding is on these files. The problem is I have 40,000+ files, how many different encodings could there be? (I'm guessing a few)