in reply to Re^2: Looking for ways to speed up the parsing of a file...
in thread Looking for ways to speed up the parsing of a file...

32 CPUs, wow! If you're interested, I think splitting the file should be very easy. To split it N ways I would:

  1. $start = 0
  2. seek() forward int($size / $N).
  3. Search forward for the next "^net" delimiter line, capturing start position in $here.
  4. Write out the chunk from $start to $here to a file "chunk.$N".
  5. Set $start = $here
  6. Loop to 2 until done.

Whether this is an overall win (it will take time) depends a lot on your disks. You'll have to tune $N to match the number of CPUs you can keep fed with data - set it too high and you'll go slower as your CPUs compete for disk access and slow each other down.

-sam

  • Comment on Re^3: Looking for ways to speed up the parsing of a file...

Replies are listed 'Best First'.
Re^4: Looking for ways to speed up the parsing of a file...
by sgifford (Prior) on May 18, 2008 at 18:59 UTC
    Instead of actually splitting the file into several additional files, you could just determine the positions as you describe, then work on the different parts by seeking to the right position before starting your processing loop. For example, you could determine the start and end position and then fork() off a new process to work on that chunk. Reading from multiple files (or different places in the same file) in parallel might end up being less efficient from an I/O perspective, though, as it could require the drive to seek a lot more. So you'd need to experiment a bit to find the right way to parallelize this.