I'm not sure what wording to use to describe this and I know I've read about issues with text conversion, when dealing with different forms, like UTF, ASCII, and so on. I can't remember them now, though. I think if I could, I'd find the answer to this quickly.

I'm trying to convert a file with text from 1878 (in other words, no copyright issues or anything like that) from HTML to text by pulling the text out, reindexing, and so on. I use regexes to pull the code out. I essentially read the entire file into one string, then break it into pages of text to work with.

The problem is the original file has characters in it that are changed. I don't do anything in the regex work I'm doing to change these characters. For example, one word (inlets) has forward and backward quote marks around it, but after reading the file into the string, pulling out the text, and saving it, instead of forward and backward quotes, I get <E2><80><9C>inlets<E2><80><9D>.

What's happening here? Is Perl trying to read a UTF file as ASCII or something similar? I'm not using the effected characters in any regexs, so I don't see how they can be changed from characters to codes. What do I need to do so the output is in the same kind of text as the input?

Thank you for any help on this or telling me what the overall issue or topic is that concerns this.


In reply to Characters Changed To Codes by HalNineThousand

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.