unknown encoding

jimw54321 has asked for the wisdom of the Perl Monks concerning the following question:

Hi. First of all, I am running 5.8.8 on solaris. Also, I do not have the option of downloading modules from cpan (company policy).

I am helping someone who receives about 100Mb of data from another organization. He needs me to identify bad characters in the data.

My initial thought was to use the "ord" function and report the line number and column number of characters < 32 ascii or > 126 ascii. After doing some reading, I am wondering if that was the wrong approach.

It is unknown what encoding this is supposed to be. Some files may be sent in the wrong encoding.

Bottom line: will my approach of < 32 ascii or > 126 ascii work despite the actual encoding sent?

The null characters are in a separate report

while( $line =~ /\G(.)/g )
{
    my $this_char = $1;
    ++$char_in_line;

    my $ascii = ord $this_char;
    if( ( $ascii < 32 || $ascii > 126 ) && $ascii != 0 )
    {
         ++$total_bad_chars{$ascii};
         printf "filename: %s -- line_n: %d -- char_in_line: %4d -- as
+cii: %4d\n",
             $filename, $line_n, $char_in_line, $ascii;
    }
}
[download]

I am continuing to read, but I am fairly well confused. Maybe this new concept (for me) will begin to gel.

Thanks, Jim

Comment on unknown encoding Download Code

Replies are listed 'Best First'.
Re: unknown encoding by moritz (Cardinal) on Oct 31, 2011 at 15:50 UTC
Bottom line: will my approach of < 32 ascii or > 126 ascii work despite the actual encoding sent? Not reliably. There are character encodings like UTF-7 that don't fit that scheme. It's really better to determine the encoding first (maybe with Encode::Guess (core module)), and then properly decode it with Encode::decode. Perl 6 - second systems done right	[reply]
Re^2: unknown encoding by jimw54321 (Acolyte) on Oct 31, 2011 at 17:11 UTC
thank you for the tip about these modules. Jim	[reply]
Re: unknown encoding by graff (Chancellor) on Nov 01, 2011 at 03:07 UTC
Here's a simple one-liner for checking the distribution of byte values in any given data stream or (set of) file(s) -- I'm using quoting that assumes a bash shell: `perl -ne '$c[$_]++ for (unpack("C")); END{printf( "%10d %02x\n",$c[$_], $_ ) for (0..255)}'` [download] You can either prefix that with `cat \|` (where * would match one or more files of interest), or append one or more file names of interest after the close quote. As indicated in the END block, the output will be a list of 256 lines, with two tokens per line: `(# of bytes) (byte value)` [download] where "byte value" (2nd column) ranges from 00 to ff, and the first column tells you how often the given byte value occurs in the data. If it's really 7-bit ascii text, then all the byte values from "80" through "ff" will have zeros in front of them. With a little practice on different types of files, it's easy to notice patterns that distinguish various types of data -- e.g. UTF-16 with lots of characters in the 0000-00FF range is easy to spot due to having about half the data showing up as null bytes (00); UTF-8 will have various patterns depending on the language of the text, but something the alphabetic languages (Latin, Cyrillic, Greek, Arabic) have in common is one or two byte values in the c0-ff range showing up a lot, plus a similar quantity of values spread out in the 80-bf range. Single-byte encodings (cp125, iso-8859-) are likewise distinctive -- they all have a sparse scattering in the a0-ff range (except Arabic, which is mostly in that range); but cp125* uses 80-9f as well, where iso-8859-* does not. You can also see quickly whether there are carriage returns in the data (0d), and if so, whether they match the quantity of line feeds (0a). If the data is supposed to be a tab-delimited table, you can check whether the number of tabs (09) divides evenly into the number of line feeds, and so on. If you're going to use this sort of diagnostic a lot (I certainly do), it'll be worth while turning it into a general utility script so you can spruce it up a bit -- handle command-line options to allow printing as a 16x16 grid instead of 256 lines; optionally print summaries (how many bytes in the 80-ff range, how many in the a0-ff range, how many white-space, etc).	[reply] [d/l] [select]
Re: unknown encoding by mbethke (Hermit) on Oct 31, 2011 at 16:02 UTC
For something on the order of 100 MB that's a lot of work, and as simple as the task is I'd just write it in C. But if you want to keep it in Perl, there's one bug and a few optimizations that comes to mind: You have to chomp the lines first or CR/LF characters will always fall in the "bad character" range. `foreach(split //)` is a lot faster than regexing yourself through single characters If you expect bad characters to be relatively rare, checking your line first with something like`/[\x1-\x20\x7f-\xff]/`to see whether it even makes sense to go through the line character by character would speed up things enormously. However, I think your right the whole task needs to get clearer. You say it's unknown what the encoding is supposed to be, but are you sure you're dealing with an 8-bit character set? As you wrote it, it would probably work for ASCII but not much else---anything from the Latin-x family (and many other charsets) may contain characters >126. The "ISO 8859 Alphabet Soup" might help visualizing what you want to check for: czyborra.com/charsets/iso8859.html Edit: fixed character range typo as per jimw54321's comment	[reply] [d/l] [select]
Re^2: unknown encoding by jimw54321 (Acolyte) on Oct 31, 2011 at 17:19 UTC
great tips. thanks. btw, I assume you meant: `/[\x1-\x20\x80-\xff]/` [download] I checked with my dba. I believes that the incoming data is supposed to be 7-bit ascii. The tip about the webpage is especially helpful. I happen to see some "A0" which appearently only applies to "CP1252 WinLatin1". thanks again.	[reply] [d/l]
Re^3: unknown encoding by Marshall (Canon) on Oct 31, 2011 at 18:28 UTC
Well if this is really supposed to be 7bit ASCII, then you are well on your way! There are only a maximum of 128 possibilities. Not sure if you have 100 Mb or 100 MB. If performance becomes an issue, then one thing to try is sysread() which will get each hunk of bytes into a single $char_string. Then use substr() to look at each byte. split(//) is slow because it has to create an array. substr() is faster because that won't happen - use the form that returns just the current single byte. However, it sounds like the main idea to just get an answer. If it takes 20 minutes, nobody is going to care!	[reply]
Re^4: unknown encoding by jimw54321 (Acolyte) on Oct 31, 2011 at 19:07 UTC
Re^5: unknown encoding by Marshall (Canon) on Oct 31, 2011 at 19:51 UTC
Some notes below your chosen depth have not been shown here
Re^3: unknown encoding by mbethke (Hermit) on Oct 31, 2011 at 18:19 UTC
You're welcome! I just noticed <code> doesn't render correctly in a list, should have properly proofread this. I actually meant \x7f instead of \x79---off the top of my head I'd have used \x80 as the start of invalid "high-ASCII" but as 0x7f is a control character like the ones below \x20 it makes sense to include it as you did in the OP.	[reply]