Masters of the dromedary,

My current project is to extract data from a proprietary format to MySQL. I use the database vendor's tool to dump the files to normal ASCII, and then I process them.

I recently got my hands on the spec for the proprietary format. I now have the knowlege to "decode" the proprietary without using their tool to dump the file to ASCII (we're talking 27-30 gig files here. Disk usage is a big concern with this method.).

My question involves iterating over string contents. The "compression" algorithm is incredibly simplistic, but effective. It uses run-length encoding for blank spaces (0xFF byte followed by ASCII byte value equaling length), and turns consecutive digits into the non-printable ASCII values. For example...

Now I can get to my question. Running speed is of the utmost importance here. I know that perl could never do this as fast as the proprietary C utility that I use to dump these 30 gig files. But if I can avoid creating temp files and read them natively in perl, I can avoid disk usage issues.

What is the most efficient way to translate those ASCII bytes in perl? Perl's smallest character value, IIRC, is the string. I need to be able to translate, as per the table above, any ASCII 0x80 into "00" in place in the string, ASCII 0x81 into "01", and so on.

I guess I could do s///, but regexes would probably be ridiculously slow. Or use index once per each type of replacement character, in combination with substring. But that would be running index 99 times (or more if there's more than one instance of the character) on over four million records @_@

I got my start coding in perl. So I am used to dealing with data in strings, not arrays of bytes. If anyone can help point me in the right direction for coding this up in the most efficient way possible, I'd be very grateful.

--
perl: code of the samurai


In reply to Translating non-printable ascii by samurai

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.