in reply to unknown encoding

For something on the order of 100 MB that's a lot of work, and as simple as the task is I'd just write it in C. But if you want to keep it in Perl, there's one bug and a few optimizations that comes to mind:

However, I think your right the whole task needs to get clearer. You say it's unknown what the encoding is supposed to be, but are you sure you're dealing with an 8-bit character set? As you wrote it, it would probably work for ASCII but not much else---anything from the Latin-x family (and many other charsets) may contain characters >126. The "ISO 8859 Alphabet Soup" might help visualizing what you want to check for: czyborra.com/charsets/iso8859.html

Edit: fixed character range typo as per jimw54321's comment

Replies are listed 'Best First'.
Re^2: unknown encoding
by jimw54321 (Acolyte) on Oct 31, 2011 at 17:19 UTC

    great tips. thanks. btw, I assume you meant:

    /[\x1-\x20\x80-\xff]/

    I checked with my dba. I believes that the incoming data is supposed to be 7-bit ascii.

    The tip about the webpage is especially helpful. I happen to see some "A0" which appearently only applies to "CP1252 WinLatin1".

    thanks again.

      Well if this is really supposed to be 7bit ASCII, then you are well on your way! There are only a maximum of 128 possibilities. Not sure if you have 100 Mb or 100 MB.

      If performance becomes an issue, then one thing to try is sysread() which will get each hunk of bytes into a single $char_string. Then use substr() to look at each byte.

      split(//) is slow because it has to create an array. substr() is faster because that won't happen - use the form that returns just the current single byte.

      However, it sounds like the main idea to just get an answer. If it takes 20 minutes, nobody is going to care!

        Hi Marshall

        My confusion began when I looked at "perldoc perluniintro" and "perldoc perlunicode". It sounds like values > 255 get wrapped around if ascii encoding is wrongly assumed. If anyone can straighten me out, that is appreciated. Should have included that in the original post.

        The repsonse from earlier led me to a webpage about various encodings. From that, I see that some data entry from the other organization may accidentally have set their encoding to "CP1252 -- WinLatin1". I happended to see "A0" which seems to only apply to that encoding.

        When I get a chance, I will try out the substr and sysread approaches.

        Thanks, Jim

      You're welcome! I just noticed <code> doesn't render correctly in a list, should have properly proofread this.

      I actually meant \x7f instead of \x79---off the top of my head I'd have used \x80 as the start of invalid "high-ASCII" but as 0x7f is a control character like the ones below \x20 it makes sense to include it as you did in the OP.