So we're talking about 300,000 to 10,000,000 lines. Assuming 50 bytes per line on average that's 15 to 500 Mbytes. That's going to take a while to process regardless of how you code it; reading 500 Mbyte just inevitably takes a while.

Is Perl really the bottleneck here (CPU-bound), or is the code spending most of the time waiting for data from the disks (I/O-bound)? If you are I/O-bound, there isn't much you'll be able to do by trading split for anything else. Any linewise approach will need to wait for data just the same. Maybe reading and processing large chunks at once rather than working linewise could help by better exploiting buffers, but that too is unlikely to make a huge difference.

If you're on a platform that has an mmap(2) syscall you might want to have a look at Sys::Mmap. mmap(2) massively reduces the overhead of I/O. A carefully constructed backtracking-free regex run against an string mmapped to a file may be noticably faster than any approach doing explicit I/O. Or it may not. Benchmark is your friend.

Makeshifts last the longest.


In reply to Re^5: Vertical split (ala cut -d:) of a file by Aristotle
in thread Vertical split (ala cut -d:) of a file by qhayaal

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.