HalNineThousand has asked for the wisdom of the Perl Monks concerning the following question:
I'm not sure what wording to use to describe this and I know I've read about issues with text conversion, when dealing with different forms, like UTF, ASCII, and so on. I can't remember them now, though. I think if I could, I'd find the answer to this quickly.
I'm trying to convert a file with text from 1878 (in other words, no copyright issues or anything like that) from HTML to text by pulling the text out, reindexing, and so on. I use regexes to pull the code out. I essentially read the entire file into one string, then break it into pages of text to work with.
The problem is the original file has characters in it that are changed. I don't do anything in the regex work I'm doing to change these characters. For example, one word (inlets) has forward and backward quote marks around it, but after reading the file into the string, pulling out the text, and saving it, instead of forward and backward quotes, I get <E2><80><9C>inlets<E2><80><9D>.
What's happening here? Is Perl trying to read a UTF file as ASCII or something similar? I'm not using the effected characters in any regexs, so I don't see how they can be changed from characters to codes. What do I need to do so the output is in the same kind of text as the input?
Thank you for any help on this or telling me what the overall issue or topic is that concerns this.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Characters Changed To Codes
by Corion (Patriarch) on Jan 23, 2011 at 20:02 UTC | |
by HalNineThousand (Beadle) on Jan 23, 2011 at 20:20 UTC |