At first, you said:
The "compression" algorithm ... uses run-length encoding for blank spaces (0xFF byte followed by ASCII byte value equaling length), and turns consecutive digits into the non-printable ASCII values... What is the most efficient way to translate those ASCII bytes in perl?

But then later you give this "example":

I need to turn (ascii characters in curly braces): MO{ASCII 0x81}B{ASCII 0x8D}CAJ{ASCII 0xA3} into: MO01B12CAJ32

The example doesn't show the 0xFF bytes that you say should precede the RLE count values (and if x81 is 01, then x8D should be 13 and xA3 should be 35), but I digress.

If the byte sequence xFFxYY (where "YY" is a byte value between x80 and xFF) is supposed to represent string of "blanks" (i.e. between 0 and 128 space characters), it sounds like the original (pre-RLE-compressed) data stream is just a fixed-width flat file, and the "xFFxYY" sequences are just field separators.

So consider the following questions:

  1. Does the input data contain line breaks (LF or CRLF) to separate the rows?
  2. Regardless of that, do you know how many fields make up each row? (I presume you do, since you're importing the data into a mysql table.)
  3. If you were to add up the "uncompressed" number of characters in each row (this would be the sum of the lengths of the "printable" fields plus the sum of the "non-printable" RLE counts for spaces), would you always get the same total width for each row?
I'm guessing that the answer to the third question is "yes", and that for each pair of "printable field value" and following "non-printable RLE count value", the total length of these two values will alway be the same for a given field.

That would mean that the RLE count is predictable from the number of characters in the preceding "printable" field. It also means that there is no need for you to retain the RLE counts. Just treat any sequence of two or more "non-printable" bytes as a field separator. Make sure that you can correctly determine the end of a "row", and push the field data into mysql; e.g. if the input data has "normal" line-breaks, you could handle it as follows:

while (<INPUT>) { s/[\x80-\xff]+/\t/g; # turn all field separators into tabs print OUTPUT; }
(If there are no line-breaks or other explicit markers of row boundaries, it's a little trickier to do it in an optimal way, but it's still quite doable.)

This assumes that the original data never contains a tab as part of a field value -- probably a safe assumption in fixed-width flat file data, but if tabs do appear as data, just use something else (maybe even a particular "non-printable" character like "\xB7" or "\xA0"). Having a single, consistent field-separator character makes it trivial to import the data into mysql. It also saves a fair number of bytes in the file that you use for loading into mysql.

You could of course use DBI to pump the data directly into mysql, but if you'll be doing this sort of data transfer a lot, you'll want to test how long it takes using DBI and no temp file, as opposed to feeding a temp file to mysqlimport (i.e. using the mysql-native "LOAD DATA INFILE" mechanism). In general, the latter goes a lot faster than running "insert" statements in DBI; even with the time it takes to generate the temp file, you could still come out ahead. (For that matter, it looks like mysql 4.1 and later will support feeding "mysqlimport" via a pipe, but I haven't tried this.)


In reply to Re: Translating non-printable ascii by graff
in thread Translating non-printable ascii by samurai

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.