Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: Huge data file and looping best practices

by samtregar (Abbot)
on Apr 26, 2009 at 16:56 UTC ( [id://760153] : note . print w/replies, xml ) Need Help??

in reply to Huge data file and looping best practices

Tough problem! You could switch to reading the file as you go but that's likely to make your program much slower since you need to access every line so many times.

I'd start by processing the input file into a more efficient representation - something that can be accessed using mmap() from C for example. If all of your characteristics are boolean (and they are in your example) you can represent them as 15 unsigned 32-bit integers. Add an integer for the patient # and you can represent a row in just 512 bits. You can write the pre-processing code in Perl using pack() or Bit::Vector.

Then I'd write some Inline::C code to mmap() the data file and provide access to "rows". The code to compare one row to another should also be written in C. It's basically an XOR of the characteristics and a bit-count of the result, so not hard to write at all. I'd definitely look at whether a lookup table can speed things up - perhaps at the 8-bit or 16-bit level. Or you could look at caching comparisons.

Finally, I'd use Parallel::ForkManager to make it 8-way parallel. Have each working processes take 1/8 of the patient space and write to its own output file. When you're done, cat all the output files together and you should be done.

I'd be shocked if this didn't run 100x faster than the Perl code you've got now.