Dear fellow monks,

I have to process two large CSV files (about 6 GB each). No problem at getting the individual fields. In each record, one of the fields is a long string of alphanumerical characters with typically 150 to 300 such characters (the number of characters is always a multiple of 5). I need to split that string into groups of five characters, in order to then reorganize that string. As far as I can say, the split is not appropriate for that. I used a regular expression, something like this:

my @sub_fields = $field16 =~ /\w{5}/g; # ...
But the process is very slow and profiling the program shows that the line above takes far too much time. I intend to do some benchmark to try to find something faster. Maybe a faster regex can be found (for example /.{5}/g might be better. I will also try to use the substr function in a loop to see if that goes faster, but I would be very happy if some nice monk could come up with some other idea likely to bring higher performance.

Another idea that I had was to use the unpack function, but I do not use it often and I am not sure how to use it to produce an array from variable-length lines. Presumably, the template should be something like "A5A5A5...". Is there any way of saying something like: "A5" repeated as many times as possible (until the end of the string? Or do I have to use a different template for each possible string length?

I was also thinking on the possibility of opening a filehandle on a reference to the string and using the read function in a loop to populate an array of chunks of five characters, but I doubt that opening a filehandle for each record of my input will really improve performance.

Does anyone out there have another better idea for improving performance, so that I could include it in by benchmark?

Thank you for your help.

Update: I of course meant to say A5 for the unpack template, not A4 as I originally typed by mistake. Thanks to those who pointed out this typo. I corrected it above to be more consistent with the text.


In reply to Performance problems on splitting long strings by Laurent_R

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.