I have a large collection of plain text files. I know each file is in one of exactly two different character encodings: Windows-1252 or UTF-8. (Well, strictly speaking, some of them are in the ASCII character encoding because they simply don't have characters in them in the range from \x80 through \xFF, or in the range from U+0080 thru U+10FFFF.) The UTF-8 files do not have byte order marks in them. None of the files are too big to slurp entirely into memory.

My objective is to normalize the text files to UTF-8 with byte order marks in them. I need to promote the Windows-1252 files to UTF-8, and I simply need to add byte order marks to the files that are already UTF-8.

What's the best way to do this? It seems I need a simple, surefire way to identify which files have one or more UTF-8-encoded characters in them.

I considered using Encode::Guess, but rejected it because it seems hinky.

Jim


In reply to What's the best way to detect character encodings, Windows-1252 v. UTF-8? by Jim

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.