Re: Translating non-printable ascii

At first, you said:

The "compression" algorithm ... uses run-length encoding for blank spaces (0xFF byte followed by ASCII byte value equaling length), and turns consecutive digits into the non-printable ASCII values... What is the most efficient way to translate those ASCII bytes in perl?

But then later you give this "example":

I need to turn (ascii characters in curly braces): MO{ASCII 0x81}B{ASCII 0x8D}CAJ{ASCII 0xA3} into: MO01B12CAJ32

The example doesn't show the 0xFF bytes that you say should precede the RLE count values (and if x81 is 01, then x8D should be 13 and xA3 should be 35), but I digress.

If the byte sequence xFFxYY (where "YY" is a byte value between x80 and xFF) is supposed to represent string of "blanks" (i.e. between 0 and 128 space characters), it sounds like the original (pre-RLE-compressed) data stream is just a fixed-width flat file, and the "xFFxYY" sequences are just field separators.

So consider the following questions:

Does the input data contain line breaks (LF or CRLF) to separate the rows?
Regardless of that, do you know how many fields make up each row? (I presume you do, since you're importing the data into a mysql table.)
If you were to add up the "uncompressed" number of characters in each row (this would be the sum of the lengths of the "printable" fields plus the sum of the "non-printable" RLE counts for spaces), would you always get the same total width for each row?

I'm guessing that the answer to the third question is "yes", and that for each pair of "printable field value" and following "non-printable RLE count value", the total length of these two values will alway be the same for a given field.

That would mean that the RLE count is predictable from the number of characters in the preceding "printable" field. It also means that there is no need for you to retain the RLE counts. Just treat any sequence of two or more "non-printable" bytes as a field separator. Make sure that you can correctly determine the end of a "row", and push the field data into mysql; e.g. if the input data has "normal" line-breaks, you could handle it as follows:

while (<INPUT>) {
    s/[\x80-\xff]+/\t/g;  # turn all field separators into tabs
    print OUTPUT;
}
[download]

(If there are no line-breaks or other explicit markers of row boundaries, it's a little trickier to do it in an optimal way, but it's still quite doable.)

This assumes that the original data never contains a tab as part of a field value -- probably a safe assumption in fixed-width flat file data, but if tabs do appear as data, just use something else (maybe even a particular "non-printable" character like "\xB7" or "\xA0"). Having a single, consistent field-separator character makes it trivial to import the data into mysql. It also saves a fair number of bytes in the file that you use for loading into mysql.

You could of course use DBI to pump the data directly into mysql, but if you'll be doing this sort of data transfer a lot, you'll want to test how long it takes using DBI and no temp file, as opposed to feeding a temp file to mysqlimport (i.e. using the mysql-native "LOAD DATA INFILE" mechanism). In general, the latter goes a lot faster than running "insert" statements in DBI; even with the time it takes to generate the temp file, you could still come out ahead. (For that matter, it looks like mysql 4.1 and later will support feeding "mysqlimport" via a pipe, but I haven't tried this.)

Comment on Re: Translating non-printable ascii Download Code