in reply to Working on huge (GB sized) files

More of a guess given your lack of details, but:

  1. Make 1 pass of the first file building a hash, keyed by the common field with the values being the file position of the start of the record containing that key.
  2. Make 1 pass through the second file, adding the file position(s) of the records containing the common keys to the hash.
  3. Process the hash, reading in the records for each key from both files, join them as appropriate, and write them to the new file.

You'll never have more than one record from each file in memory at any given time, so beyond the size of the hash required to hold the keys, the size of the files doesn't matter.

An O(3N) process should be substantially quicker than 2 x O(NlogN) sorts + an O(N) merge if coded properly.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Working on huge (GB sized) files
by locked_user sundialsvc4 (Abbot) on May 12, 2011 at 16:47 UTC

    ... yes, and if the number of records is huge, such that the memory consumption of in-memory data structures is problematic (highly unlikely, these days...), you can also throw the keys into separate files, sort the two files identically using an external disk sort, and process the two streams in a single sequential pass ... a classic “merge” operation.   (What worked in the days of COBOL still works today.)

    If the “huge files” happen to be XML files, modules such as XML::Twig are designed to work with files even of that size, without overwhelming memory.

      you can also throw the keys into separate files, sort the two files identically using an external disk sort, and process the two streams in a single sequential pass ... a classic “merge” operation. (What worked in the days of COBOL still works today.)

      "Works" yes. Efficient? No.

      That is exactly the "2 x O(NlogN) sorts + an O(N) merge" process I was decrying.

      Let's say that each file consists of 100 million records.

      1. 2 * ( 1e8 * log( 1e8 ) ) + 1e8 = 1.7 billion operations.
      2. 3 * 1e8                        = 0.3 billion operations.

      So, theoretically 6 times longer. And in reality, usually 15 or 20 times longer, because the two sorts themselves involve multiple passes and a merge phase not taken into consideration by the above simplistic equations.

      And given that many of those operations are I/O, and therefore still measured in milliseconds (little different to how they were in the '70s) rather than nanoseconds now for cpu ops (which took 10s or 100s of microseconds back in the '70s), it is easy to see that it is worth expending a lot of effort (human ingenuity and cpu cycles) to avoid moving to the disk-based approach.

      If you perform the merge both ways--hash&memory .v. disksort&merge--of two files large enough that the former just fits within your given memory limit, then typically the latter will take close to 20 times longer. Then when faced with the situation where one more record pushes you beyond the memory limit, it is worth expending considerable effort to try and fit that extra record in memory and so avoid the 20x penalty.

      For example, perl's hashes are designed for very general purpose usage, and as such are quite memory hungry. It is quite easy to come up with a hash-like structure that supports the subset of operations required for this particular purpose whilst using substantially less memory in the process. And even if this is implemented in pure perl and therefore considerably slower than Perl's built-in hashes, if it allows you to fit that extra record (and achieving 10 times more is possible), then you will still have saved time by avoiding the 20 times slow down.

      Those old COBOL techniques worked and still do. But they were used back then because there was no feasible alternative, not because they were a good solution At a pinch, when all else fails, they are a fall-back position. A last resort, not a first.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: Working on huge (GB sized) files
by Marshall (Canon) on May 13, 2011 at 15:37 UTC
    A limitation of the O notation is that doesn't take into account the cost of each operation.

    Reading all of the data sequentially off of the disk for steps 1,2 is relatively fast. There can be trouble with step 3, because it involves seeking around within the files to gather up the records - reading randomly selected records usually takes a lot longer than reading sequential records. The "wall clock" time of a seek operation and rotational delays is in msec. One of the sort comparisons is in nsecs. Thousands of comparisons can be performed in the time it takes to do one seek.

    Which way actually works out to be fastest "wall clock wise" in practice depends upon a lot of variables, including a lot of OS and file system configuration stuff (like details of the disk caching,etc.). As well as the details of exactly what kind of data format we are dealing with - size of records, how many, length of the file. Right now all we know is the the files are <1GB and will not fit into memory.

    So like many things with programming, there are a lot of "yeah but's". The OP says he is having trouble solving the problem by any method. It could be that the run time doesn't matter and that what matters is the implementation effort. In that case, I would recommend the easiest to implement - whatever that is for the OP.

    The observation that it is always going to be faster if all of the data can fit into memory is absolutely true. This is worth a considerable amount of coding effort if the task is to be performed often. I have no idea of how the OP is going to weigh the different trade-offs. Perhaps he will report back in after he has tried some of these ideas.

      First, I mostly agree with everything you've said, hence the first line of my OP above: "More of a guess given your lack of details,".

      But there are a couple of points worth picking up on:

      1. There can be trouble with step 3, because it involves seeking around within the files to gather up the records - reading randomly selected records usually takes a lot longer than reading sequential records.

        Agreed. But this is often more than balanced out by the merge steps of disk-based sorts.

        Disk-based sorts get around the memory problem by reading a chunk of the source, sorting it and writing it to a temp file; read another chunk, sort and write to temp.

        Whilst the reading and writing of the chunks are sequential, they are interleaved, and that alone has a substantial performance impact over reading the whole sequentially and then writing the whole sequentially.

        But of far bigger significance is the merge phase of the temporary files. Whilst the reading of records from a given temp file is sequential, collectively they are random. This has at least as much impact (disk seeks) as reading a single file randomly. But it also has a secondary impact, that of forcing the file-cache to juggle multiple active files, and the impacts of that are far harder to predict and can be substantial.

        The problem is the nature of the patterns of usage in the merge phase. Sometimes many records will be read from one temporary file, then many from another, and then another and then (say) back to the first. At other points, the access pattern will skip from file to file for every record.

        This randomly changing pattern of usage not only confuses most intelligent caching algorithms by continually switching from mostly sequential to mostly random and all points in between. The confusion can actively work against performance by (say) flushing newly-read, part-accessed blocks from the cache prematurely because previous access patterns have suggested that the access is random and so the rest the block will not be used soon. Or conversely, retaining part-accessed blocks far longer than is good for the rest of the system because access patterns suggest sequential access and so it will be needed soon.

        Most (if not all) caching heuristics only consider a single file at a time, even when multiple files are being accessed by the same process, so in effect, the multiple open files of the merge phase of an external sort act as if competing processes were at work.

        So the sequential reading of records randomly distributed amongst a number of temporary files is at least as costly, and usually much more so, than the random access of records from just two files. And remember that in the sort&merge approach, there are three merge phases, not one as with the hash&memory method.

      2. The "wall clock" time of a seek operation and rotational delays is in msec. One of the sort comparisons is in nsecs. Thousands of comparisons can be performed in the time it takes to do one seek.

        Yes, but ... :)

        With the sort&merge, every record of both files have to be read from and written back to disk twice each during the two sorts. Once when sorting the chunks; once when merging them. and then each record has to be read and written a third time when merging the two sorted files.

        Assuming (for simplicity) equal sized files:

        • For the sort&merge:

          that is 8N reads and 8N writes; with 2N reads being effectively random.

        • For the hash&memory:

          There are 4N reads and 2N writes; of which 2N reads are random.

        So, even discounting the {2 * P * O( N/P log(N/P)} sorting .v. {O( 2N )} hashing differences, that's { 6N-SR + 2N-RR + 8N-SW } versus { 2N-SR + 2N-RR + 2N-SW }. That's a lot of wall time to be made up.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        The problem is the nature of the patterns of usage in the merge phase. Sometimes many records will be read from one temporary file, then many from another, and then another and then (say) back to the first. At other points, the access pattern will skip from file to file for every record.

        This ugly pattern can be easy eliminated at the application layer reading the records in blocks instead of one by one.

        Good discussion. I agree there is certainly some guess work involved in this.

        Disk-based sorts get around the memory problem by reading a chunk of the source, sorting it and writing it to a temp file; read another chunk, sort and write to temp.
        Yes, agree.
        Whilst the reading and writing of the chunks are sequential, they are interleaved, and that alone has a substantial performance impact over reading the whole sequentially and then writing the whole sequentially.
        I am not so sure about this. I guess that what you mean by interleaved is that program switches back and forth from read to write. Typically the huge input file would be read sequentially, but by reading big hunks (maybe thousands of records at a time). Sort. Then write big hunk to a new file. So data gets read once and written once.

        In the create hash table scenario, the data only has to be read once - there is no write. Presumably the data is also read in big hunks, say same size hunk as above.

        The system will certainly allow other I/O processing right after the "big block" request has been satisfied. The system probably is also interleaving other disk I/O requests while the "big block" operation is proceeding. That's true in either scenario above. So when 2nd read needs to happen in either case, the disk is most likely somewhere else and not sitting right at next block to be read. I see a 2x difference, which is "substantial". I don't know if you meant that it was a lot more than that or not.

        There are of course all kinds of things that can be happening cache wise. Not clear how much if any of the write data will be kept once it is flushed to disk as the file presumably is opened as write only as a big clue. In both scenarios there could also be read-ahead happening.

        I think the big difference is that in the hash table scenario, each disk operation is going to be getting a random record, maybe 2 different records from different parts of the file. Whereas, the merge is going to be getting thousands of sequential records and does this when one of its n queues runs dry. So it looks to me like the hash approach has the potential for generating many more disk operations than the sort/merge - enough to swamp this initial 2x I/O difference.

        How all of this is going to work out is very dependent upon what system it is running on, how the caching is set up and what other processes are doing at the same time, how much of the total system resources this collation process is able to use... I think this is going to come down to a YMMV situation. And although the sort is a Nlog N operation versus an N operation for scanning and building the hash, the whole job is going to be dominated by disk I/O time. The CPU operations won't matter even if there is a heck of a lot more of them because they are so fast compared with the disk. I suspect the hash approach uses less actual CPU time, but takes more "wall clock" time.