HalNineThousand has asked for the wisdom of the Perl Monks concerning the following question:

I'm not sure what wording to use to describe this and I know I've read about issues with text conversion, when dealing with different forms, like UTF, ASCII, and so on. I can't remember them now, though. I think if I could, I'd find the answer to this quickly.

I'm trying to convert a file with text from 1878 (in other words, no copyright issues or anything like that) from HTML to text by pulling the text out, reindexing, and so on. I use regexes to pull the code out. I essentially read the entire file into one string, then break it into pages of text to work with.

The problem is the original file has characters in it that are changed. I don't do anything in the regex work I'm doing to change these characters. For example, one word (inlets) has forward and backward quote marks around it, but after reading the file into the string, pulling out the text, and saving it, instead of forward and backward quotes, I get <E2><80><9C>inlets<E2><80><9D>.

What's happening here? Is Perl trying to read a UTF file as ASCII or something similar? I'm not using the effected characters in any regexs, so I don't see how they can be changed from characters to codes. What do I need to do so the output is in the same kind of text as the input?

Thank you for any help on this or telling me what the overall issue or topic is that concerns this.

Replies are listed 'Best First'.
Re: Characters Changed To Codes
by Corion (Patriarch) on Jan 23, 2011 at 20:02 UTC

    The overall issue is "Encodings". You need to find out what encoding your input is in, and what encoding your output is in (and what the program you're using to display your output thinks the output is encoded in), and make sure that you properly convert between these, or send the appropriate headers etc. to tell all programs involved about the encoding.

    For Perl, the best general approach is to convert to UTF-8 on input, and to convert to the target encoding on output using Encode::decode and ::encode. Ideally, your target encoding also is UTF-8. For example, if you're outputting to HTML, you can tell the browser the encoding in the <!DOCTYPE part of the document.

      I had been viewing the input and output files in different viewers and had thought they were reliable. It turns out that was a big mistake (okay, it's obvious now!). I pulled up a hex editor and looked over the codes and found the encoding was not getting messed up, but I was not specifying the encoding in the output file.

      In the past I've only used HTML with my own sites or in specific usage situations, so encoding has never been an issue for me for anything -- obviously, otherwise I would have known the term.

      I see there's a LOT of info out there on encoding, so thanks for suggesting the obvious (that I had overlooked) and for giving me the right term to use for researching this. That's a BIG help!