in reply to Re^4: Vertical split (ala cut -d:) of a file
in thread Vertical split (ala cut -d:) of a file

So we're talking about 300,000 to 10,000,000 lines. Assuming 50 bytes per line on average that's 15 to 500 Mbytes. That's going to take a while to process regardless of how you code it; reading 500 Mbyte just inevitably takes a while.

Is Perl really the bottleneck here (CPU-bound), or is the code spending most of the time waiting for data from the disks (I/O-bound)? If you are I/O-bound, there isn't much you'll be able to do by trading split for anything else. Any linewise approach will need to wait for data just the same. Maybe reading and processing large chunks at once rather than working linewise could help by better exploiting buffers, but that too is unlikely to make a huge difference.

If you're on a platform that has an mmap(2) syscall you might want to have a look at Sys::Mmap. mmap(2) massively reduces the overhead of I/O. A carefully constructed backtracking-free regex run against an string mmapped to a file may be noticably faster than any approach doing explicit I/O. Or it may not. Benchmark is your friend.

Makeshifts last the longest.

  • Comment on Re^5: Vertical split (ala cut -d:) of a file