in reply to Translating non-printable ascii
The "compression" algorithm ... uses run-length encoding for blank spaces (0xFF byte followed by ASCII byte value equaling length), and turns consecutive digits into the non-printable ASCII values... What is the most efficient way to translate those ASCII bytes in perl?
But then later you give this "example":
I need to turn (ascii characters in curly braces): MO{ASCII 0x81}B{ASCII 0x8D}CAJ{ASCII 0xA3} into: MO01B12CAJ32
The example doesn't show the 0xFF bytes that you say should precede the RLE count values (and if x81 is 01, then x8D should be 13 and xA3 should be 35), but I digress.
If the byte sequence xFFxYY (where "YY" is a byte value between x80 and xFF) is supposed to represent string of "blanks" (i.e. between 0 and 128 space characters), it sounds like the original (pre-RLE-compressed) data stream is just a fixed-width flat file, and the "xFFxYY" sequences are just field separators.
So consider the following questions:
That would mean that the RLE count is predictable from the number of characters in the preceding "printable" field. It also means that there is no need for you to retain the RLE counts. Just treat any sequence of two or more "non-printable" bytes as a field separator. Make sure that you can correctly determine the end of a "row", and push the field data into mysql; e.g. if the input data has "normal" line-breaks, you could handle it as follows:
(If there are no line-breaks or other explicit markers of row boundaries, it's a little trickier to do it in an optimal way, but it's still quite doable.)while (<INPUT>) { s/[\x80-\xff]+/\t/g; # turn all field separators into tabs print OUTPUT; }
This assumes that the original data never contains a tab as part of a field value -- probably a safe assumption in fixed-width flat file data, but if tabs do appear as data, just use something else (maybe even a particular "non-printable" character like "\xB7" or "\xA0"). Having a single, consistent field-separator character makes it trivial to import the data into mysql. It also saves a fair number of bytes in the file that you use for loading into mysql.
You could of course use DBI to pump the data directly into mysql, but if you'll be doing this sort of data transfer a lot, you'll want to test how long it takes using DBI and no temp file, as opposed to feeding a temp file to mysqlimport (i.e. using the mysql-native "LOAD DATA INFILE" mechanism). In general, the latter goes a lot faster than running "insert" statements in DBI; even with the time it takes to generate the temp file, you could still come out ahead. (For that matter, it looks like mysql 4.1 and later will support feeding "mysqlimport" via a pipe, but I haven't tried this.)
|
|---|