in reply to Re: unknown encoding
in thread unknown encoding

great tips. thanks. btw, I assume you meant:

/[\x1-\x20\x80-\xff]/

I checked with my dba. I believes that the incoming data is supposed to be 7-bit ascii.

The tip about the webpage is especially helpful. I happen to see some "A0" which appearently only applies to "CP1252 WinLatin1".

thanks again.

Replies are listed 'Best First'.
Re^3: unknown encoding
by Marshall (Canon) on Oct 31, 2011 at 18:28 UTC
    Well if this is really supposed to be 7bit ASCII, then you are well on your way! There are only a maximum of 128 possibilities. Not sure if you have 100 Mb or 100 MB.

    If performance becomes an issue, then one thing to try is sysread() which will get each hunk of bytes into a single $char_string. Then use substr() to look at each byte.

    split(//) is slow because it has to create an array. substr() is faster because that won't happen - use the form that returns just the current single byte.

    However, it sounds like the main idea to just get an answer. If it takes 20 minutes, nobody is going to care!

      Hi Marshall

      My confusion began when I looked at "perldoc perluniintro" and "perldoc perlunicode". It sounds like values > 255 get wrapped around if ascii encoding is wrongly assumed. If anyone can straighten me out, that is appreciated. Should have included that in the original post.

      The repsonse from earlier led me to a webpage about various encodings. From that, I see that some data entry from the other organization may accidentally have set their encoding to "CP1252 -- WinLatin1". I happended to see "A0" which seems to only apply to that encoding.

      When I get a chance, I will try out the substr and sysread approaches.

      Thanks, Jim
        Perl deals with ASCII unless otherwise specified. That means a one to one mapping of: one byte => one character.

        Perl can deal with any character encoding that I know of. But you have to tell it what character encoding is expected, UTF-8, UTF-16 or whatever.

        My comment about substr() concerned how to optimize the processing to make things run faster when dealing with typical ASCII 7 bit or even 8 bit encoding. I didn't mean to confuse. It sounds like your friend just wants an answer.

        Let's worry about how to make things run faster, if and when that is necessary. Right now, I think that is just an acedemic exercise, but I'd help you with that if you want.

        For many problems, being optimally efficient is just not necessary! The substr() idea will work closely to the way that the processing would be done in 'C' and will be faster than the split(//), but at the "expense" of more programming effort.

        Perl is a language that can solve problems quickly (in terms of coding efficiency). And it has many features that allow it to run very close to say a 'C' program in terms of performance.

        How fast is fast enough? Well that depends. I have one app that takes 4+ hours on my machine to run. I have another team member who can do it in 56 minutes. Another team member has a new machine on the way that can do it in <40 minutes. How fast is enough? Well, <one hour is "fast enough". 40 minutes vs 56 minutes won't make any difference in this app because it takes us hours+ to "get ready for the next run". 40 minutes vs 4 hours makes a difference because we could get ready for a new run in 2 hours and get two runs done in a day.

        Programming involves trade offs between how long it take you to write the code vs how long the code runs and a whole bunch of other factors. Sometimes slower code is better because it is easier to understand and maintain.

        In general, from my experience, the thing to optimize is your ability to write clear, maintainable code. Usually but not always, clear code is fast code, assuming that this "clear code" uses an efficient algorithm. Deciding upon the algorithm is the, the most important part to writing clear, fast code.

        Hope that this very long post was understandable to you.

        PS: I am working on a new version of this app and it will run in like 20 minutes on my machine (although I have only promised a x4 speed increase) - another programming trick, promise less than you think that you can do (based upon benchmarks)! I am adding a lot of features and this requires hundreds of hours of work. The complexity is x10. If I get all the new features in there and it runs within an hour on my machine, everybody is going to be happy.

Re^3: unknown encoding
by mbethke (Hermit) on Oct 31, 2011 at 18:19 UTC

    You're welcome! I just noticed <code> doesn't render correctly in a list, should have properly proofread this.

    I actually meant \x7f instead of \x79---off the top of my head I'd have used \x80 as the start of invalid "high-ASCII" but as 0x7f is a control character like the ones below \x20 it makes sense to include it as you did in the OP.