Jim has asked for the wisdom of the Perl Monks concerning the following question:
Two years ago, I posted What's the best way to detect character encodings, Windows-1252 v. UTF-8? to SoPW. I got plenty of helpful answers to my question then. Now, I need to solve essentially the same problem again, but with UTF-16/UTF-16LE/UTF-16BE added to the mix.
Is there a Perl module that will automatically detect text files in these character encodings and normalize them to UTF-8 with byte order marks?
For my purposes, I can assume that text in a single-byte "legacy" encoding (i.e., not Unicode) consisting solely of characters in the ranges 01-7F and A0-FF is ISO-8859-1. If it has characters in the ranges 80-9F as well, it's Windows-1252. In other words, I can pretend there's no such thing as C1 control codes. (This is what all modern web browsers do, and it's what's specified in the draft HTML5 specification.)
UPDATE: I also want to know which of the lowest common denominator encodings each text file is in. For example, a file that consists solely of bytes in the range 01-7F is, for my purposes, ASCII. Sure, it's also ISO-8859-1, Windows-1252, UTF-8, and dozens of other encodings besides. But it's strictly in the ASCII character encoding, so that's what I want it to be identified as.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: What's the best way to detect character encodings? (Redux)
by jakeease (Friar) on Jun 10, 2013 at 08:11 UTC | |
|
Re: What's the best way to detect character encodings? (Redux)
by gnosti (Chaplain) on Jun 10, 2013 at 05:02 UTC | |
by Jim (Curate) on Jun 10, 2013 at 05:12 UTC |