Part 1: $encoded = encode_entities($input, "\xA0-\x{FFFD}"); -- Sadly it didn't work.

I then began to try to investigate the actual encoding used for the files. Maybe if I can figure out that, then I can figure out how to properly convert them.

I don't have File::MMagic as suggested at How do I determine encoding format of a file ? but I do have Encode::Guess, I got that running and immediately got Unknown encoding error exactly at the place where I have a garbage character. When running Encode::Guess on the data as a string (instead of an array) I got No appropriate encodings found!

I focused in on this character, maybe it could give some clues as to my problem. I used the ord() function to try and isolate the character. Two characters return junk and their decimal equivalents are 226 and 128. The 226 is valid but 128 isn't. To top all of that, I'm positive that the user's intended character was a hyphen.

I feel even more lost than when I started. None of the solutions provided work properly, I either get more junk characters or I get valid characters that shouldn't be there at all.

I think I'll give up on this question and try and chase down how to determine what the character encoding is on these files. The problem is I have 40,000+ files, how many different encodings could there be? (I'm guessing a few)


In reply to Re^2: Removing Unsafe Characters by Praethen
in thread Removing Unsafe Characters by Praethen

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.