Re: unknown encoding

For something on the order of 100 MB that's a lot of work, and as simple as the task is I'd just write it in C. But if you want to keep it in Perl, there's one bug and a few optimizations that comes to mind:

You have to chomp the lines first or CR/LF characters will always fall in the "bad character" range.
foreach(split //) is a lot faster than regexing yourself through single characters
If you expect bad characters to be relatively rare, checking your line first with something like/[\x1-\x20\x7f-\xff]/to see whether it even makes sense to go through the line character by character would speed up things enormously.

However, I think your right the whole task needs to get clearer. You say it's unknown what the encoding is supposed to be, but are you sure you're dealing with an 8-bit character set? As you wrote it, it would probably work for ASCII but not much else---anything from the Latin-x family (and many other charsets) may contain characters >126. The "ISO 8859 Alphabet Soup" might help visualizing what you want to check for: czyborra.com/charsets/iso8859.html

Edit: fixed character range typo as per jimw54321's comment

Comment on Re: unknown encoding Select or Download Code

Replies are listed 'Best First'.
Re^2: unknown encoding by jimw54321 (Acolyte) on Oct 31, 2011 at 17:19 UTC
great tips. thanks. btw, I assume you meant: `/[\x1-\x20\x80-\xff]/` [download] I checked with my dba. I believes that the incoming data is supposed to be 7-bit ascii. The tip about the webpage is especially helpful. I happen to see some "A0" which appearently only applies to "CP1252 WinLatin1". thanks again.	[reply] [d/l]
Re^3: unknown encoding by Marshall (Canon) on Oct 31, 2011 at 18:28 UTC
Well if this is really supposed to be 7bit ASCII, then you are well on your way! There are only a maximum of 128 possibilities. Not sure if you have 100 Mb or 100 MB. If performance becomes an issue, then one thing to try is sysread() which will get each hunk of bytes into a single $char_string. Then use substr() to look at each byte. split(//) is slow because it has to create an array. substr() is faster because that won't happen - use the form that returns just the current single byte. However, it sounds like the main idea to just get an answer. If it takes 20 minutes, nobody is going to care!	[reply]
Re^4: unknown encoding by jimw54321 (Acolyte) on Oct 31, 2011 at 19:07 UTC
Hi Marshall My confusion began when I looked at "perldoc perluniintro" and "perldoc perlunicode". It sounds like values > 255 get wrapped around if ascii encoding is wrongly assumed. If anyone can straighten me out, that is appreciated. Should have included that in the original post. The repsonse from earlier led me to a webpage about various encodings. From that, I see that some data entry from the other organization may accidentally have set their encoding to "CP1252 -- WinLatin1". I happended to see "A0" which seems to only apply to that encoding. When I get a chance, I will try out the substr and sysread approaches. Thanks, Jim	[reply]
Re^5: unknown encoding by Marshall (Canon) on Oct 31, 2011 at 19:51 UTC
Re^6: unknown encoding by Lotus1 (Vicar) on Oct 31, 2011 at 20:30 UTC
Some notes below your chosen depth have not been shown here
Re^3: unknown encoding by mbethke (Hermit) on Oct 31, 2011 at 18:19 UTC
You're welcome! I just noticed <code> doesn't render correctly in a list, should have properly proofread this. I actually meant \x7f instead of \x79---off the top of my head I'd have used \x80 as the start of invalid "high-ASCII" but as 0x7f is a control character like the ones below \x20 it makes sense to include it as you did in the OP.	[reply]