Reading all of the data sequentially off of the disk for steps 1,2 is relatively fast. There can be trouble with step 3, because it involves seeking around within the files to gather up the records - reading randomly selected records usually takes a lot longer than reading sequential records. The "wall clock" time of a seek operation and rotational delays is in msec. One of the sort comparisons is in nsecs. Thousands of comparisons can be performed in the time it takes to do one seek.
Which way actually works out to be fastest "wall clock wise" in practice depends upon a lot of variables, including a lot of OS and file system configuration stuff (like details of the disk caching,etc.). As well as the details of exactly what kind of data format we are dealing with - size of records, how many, length of the file. Right now all we know is the the files are <1GB and will not fit into memory.
So like many things with programming, there are a lot of "yeah but's". The OP says he is having trouble solving the problem by any method. It could be that the run time doesn't matter and that what matters is the implementation effort. In that case, I would recommend the easiest to implement - whatever that is for the OP.
The observation that it is always going to be faster if all of the data can fit into memory is absolutely true. This is worth a considerable amount of coding effort if the task is to be performed often. I have no idea of how the OP is going to weigh the different trade-offs. Perhaps he will report back in after he has tried some of these ideas.
In reply to Re^2: Working on huge (GB sized) files
by Marshall
in thread Working on huge (GB sized) files
by vasavi
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |