Bottom line: will my approach of < 32 ascii or > 126 ascii work despite the actual encoding sent?
Not reliably. There are character encodings like UTF-7 that don't fit that scheme.
It's really better to determine the encoding first (maybe with Encode::Guess (core module)), and then properly decode it with Encode::decode.
| [reply] |
thank you for the tip about these modules. Jim
| [reply] |
Here's a simple one-liner for checking the distribution of byte values in any given data stream or (set of) file(s) -- I'm using quoting that assumes a bash shell:
perl -ne '$c[$_]++ for (unpack("C*"));
END{printf( "%10d %02x\n",$c[$_], $_ ) for (0..255)}'
You can either prefix that with cat * | (where * would match one or more files of interest), or append one or more file names of interest after the close quote. As indicated in the END block, the output will be a list of 256 lines, with two tokens per line:
(# of bytes) (byte value)
where "byte value" (2nd column) ranges from 00 to ff, and the first column tells you how often the given byte value occurs in the data. If it's really 7-bit ascii text, then all the byte values from "80" through "ff" will have zeros in front of them.
With a little practice on different types of files, it's easy to notice patterns that distinguish various types of data -- e.g. UTF-16 with lots of characters in the 0000-00FF range is easy to spot due to having about half the data showing up as null bytes (00); UTF-8 will have various patterns depending on the language of the text, but something the alphabetic languages (Latin, Cyrillic, Greek, Arabic) have in common is one or two byte values in the c0-ff range showing up a lot, plus a similar quantity of values spread out in the 80-bf range.
Single-byte encodings (cp125*, iso-8859-*) are likewise distinctive -- they all have a sparse scattering in the a0-ff range (except Arabic, which is mostly in that range); but cp125* uses 80-9f as well, where iso-8859-* does not. You can also see quickly whether there are carriage returns in the data (0d), and if so, whether they match the quantity of line feeds (0a). If the data is supposed to be a tab-delimited table, you can check whether the number of tabs (09) divides evenly into the number of line feeds, and so on.
If you're going to use this sort of diagnostic a lot (I certainly do), it'll be worth while turning it into a general utility script so you can spruce it up a bit -- handle command-line options to allow printing as a 16x16 grid instead of 256 lines; optionally print summaries (how many bytes in the 80-ff range, how many in the a0-ff range, how many white-space, etc). | [reply] [d/l] [select] |
For something on the order of 100 MB that's a lot of work, and as simple as the task is I'd just write it in C.
But if you want to keep it in Perl, there's one bug and a few optimizations that comes to mind:
- You have to chomp the lines first or CR/LF characters will always fall in the "bad character" range.
- foreach(split //) is a lot faster than regexing yourself through single characters
- If you expect bad characters to be relatively rare, checking your line first with something like/[\x1-\x20\x7f-\xff]/to see whether it even makes sense to go through the line character by character would speed up things enormously.
However, I think your right the whole task needs to get clearer. You say it's unknown what the encoding is supposed to be, but are you sure you're dealing with an 8-bit character set? As you wrote it, it would probably work for ASCII but not much else---anything from the Latin-x family (and many other charsets) may contain characters >126.
The "ISO 8859 Alphabet Soup" might help visualizing what you want to check for: czyborra.com/charsets/iso8859.html
Edit: fixed character range typo as per jimw54321's comment
| [reply] [d/l] [select] |
/[\x1-\x20\x80-\xff]/
I checked with my dba. I believes that the incoming data is supposed to be 7-bit ascii.
The tip about the webpage is especially helpful. I happen to see some "A0" which appearently only applies to "CP1252 WinLatin1".
thanks again. | [reply] [d/l] |
Well if this is really supposed to be 7bit ASCII, then you are well on your way! There are only a maximum of 128 possibilities. Not sure if you have 100 Mb or 100 MB.
If performance becomes an issue, then one thing to try is sysread() which will get each hunk of bytes into a single $char_string. Then use substr() to look at each byte.
split(//) is slow because it has to create an array. substr() is faster because that won't happen - use the form that returns just the current single byte.
However, it sounds like the main idea to just get an answer. If it takes 20 minutes, nobody is going to care!
| [reply] |
You're welcome! I just noticed <code> doesn't render correctly in a list, should have properly proofread this. I actually meant \x7f instead of \x79---off the top of my head I'd have used \x80 as the start of invalid "high-ASCII" but as 0x7f is a control character like the ones below \x20 it makes sense to include it as you did in the OP.
| [reply] |