in reply to When the input file is huge !!!

You have 2 files. If the first is large (a few hundred MB) then I'd expect perl to run out of memory before you've loaded it. If you're using virtual memory this might take a loong time. if it succeeded in loading that but is swapping, then the second will take forever to process. Either way you aren't handling 8 GB of data in RAM at once.

I would use a sort utility on both files. That is available as a Unix utility. I have not used it, but Sort::External is a pure Perl version if you don't have that utility. Then process both files in parallel. With the idea being that sequences come up in the same order in both files. So you have 2 filehandles (one for each file) and 2 last lines (one for each file), and you always read from whichever one is smaller, processing a match when you find one. That way you do not keep any data in RAM.

Be warned that sorting 8 GB is liable to take 20 minutes or so on your machine.

Replies are listed 'Best First'.
Re^2: When the input file is huge !!!
by Marshall (Canon) on Jan 06, 2009 at 05:28 UTC
    This sounds like a pretty good idea! A simple sort-merge algorithm will take each "hunk of data" that it can handle, sort that and write it to a new place on disk. This requires one disk read of the entire data set and one write of the data set. 8 GB in the scheme of things is not that "big".
    Let's say that each "hunk" is just 500 MB, which my Windows machine can sort easily, we wind up with 16 "hunks". The merge will open say 16 files at once and the next part is easy, just move the top record of each of the 16 to the output. So for "small data sets" like 8GB: 1)read once, 2)write once, 3)read again, 4) write again.
    I think that the system utils are faster than this. Very smart ones will shovel stuff between various disks to speed access up and algorithms are smarter than described above. Anyway "sort" on a big machine is heavily optimized. The Unix command line will probably do much better than you think.
      FYI a simple merge-sort will take log(n)/log(2) passes through the data. You are describing a somewhat more intelligent merge that will take fewer passes. Plus you are eliminating most of the passes by using a different sort at once.

      The merge is likely somewhat slower than you describe. The problem is how parallel read/write streams the drive can handle without resorting to streaming. My memory says that the limit is 4. Which means that you can at most read 3 chunks and write 1. If you try to do more, then your disk starts to thrash. If that is right, then it can handle 16 hunks in 3 passes through the dataset, where each pass has to both read and write all of the data. That is in addition to the original pass that sorts it into chunks. That means the dataset has to be streamed 8 times (4 times from disk, 4 times to). If streaming takes 1 minute per pass (that was what the OP said about his dataset) that would be 8 minutes. That is substantially faster than what I said, but I like to be conservative when I tell people that things will be slow because I don't want them to prematurely give up if it is sower than they expect.

      Also note that a pure Perl implementation probably will have trouble sorting 500 MB chunks of data in RAM because of the memory overhead of internal data structures. However if you are reading from and writing to the filesystem, you can wind up creating, writing, reading and deleting a file with all operations happening in RAM without getting flushed to disk. If intelligently used, this fact can greatly increase performance because most passes do not touch disk. This can make a fairly naive merge-sort perform reasonably close to a much more optimized one.