Re: Working on huge (GB sized) files
by BrowserUk (Patriarch) on May 12, 2011 at 16:17 UTC
|
More of a guess given your lack of details, but:
- Make 1 pass of the first file building a hash, keyed by the common field with the values being the file position of the start of the record containing that key.
- Make 1 pass through the second file, adding the file position(s) of the records containing the common keys to the hash.
- Process the hash, reading in the records for each key from both files, join them as appropriate, and write them to the new file.
You'll never have more than one record from each file in memory at any given time, so beyond the size of the hash required to hold the keys, the size of the files doesn't matter.
An O(3N) process should be substantially quicker than 2 x O(NlogN) sorts + an O(N) merge if coded properly.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
... yes, and if the number of records is huge, such that the memory consumption of in-memory data structures is problematic (highly unlikely, these days...), you can also throw the keys into separate files, sort the two files identically using an external disk sort, and process the two streams in a single sequential pass ... a classic “merge” operation. (What worked in the days of COBOL still works today.)
If the “huge files” happen to be XML files, modules such as XML::Twig are designed to work with files even of that size, without overwhelming memory.
| |
|
|
you can also throw the keys into separate files, sort the two files identically using an external disk sort, and process the two streams in a single sequential pass ... a classic “merge” operation. (What worked in the days of COBOL still works today.)
"Works" yes. Efficient? No.
That is exactly the "2 x O(NlogN) sorts + an O(N) merge" process I was decrying.
Let's say that each file consists of 100 million records.
- 2 * ( 1e8 * log( 1e8 ) ) + 1e8 = 1.7 billion operations.
- 3 * 1e8 = 0.3 billion operations.
So, theoretically 6 times longer. And in reality, usually 15 or 20 times longer, because the two sorts themselves involve multiple passes and a merge phase not taken into consideration by the above simplistic equations.
And given that many of those operations are I/O, and therefore still measured in milliseconds (little different to how they were in the '70s) rather than nanoseconds now for cpu ops (which took 10s or 100s of microseconds back in the '70s), it is easy to see that it is worth expending a lot of effort (human ingenuity and cpu cycles) to avoid moving to the disk-based approach.
If you perform the merge both ways--hash&memory .v. disksort&merge--of two files large enough that the former just fits within your given memory limit, then typically the latter will take close to 20 times longer. Then when faced with the situation where one more record pushes you beyond the memory limit, it is worth expending considerable effort to try and fit that extra record in memory and so avoid the 20x penalty.
For example, perl's hashes are designed for very general purpose usage, and as such are quite memory hungry. It is quite easy to come up with a hash-like structure that supports the subset of operations required for this particular purpose whilst using substantially less memory in the process. And even if this is implemented in pure perl and therefore considerably slower than Perl's built-in hashes, if it allows you to fit that extra record (and achieving 10 times more is possible), then you will still have saved time by avoiding the 20 times slow down.
Those old COBOL techniques worked and still do. But they were used back then because there was no feasible alternative, not because they were a good solution At a pinch, when all else fails, they are a fall-back position. A last resort, not a first.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|
A limitation of the O notation is that doesn't take into account the cost of each operation.
Reading all of the data sequentially off of the disk for steps 1,2 is relatively fast. There can be trouble with step 3, because it involves seeking around within the files to gather up the records - reading randomly selected records usually takes a lot longer than reading sequential records. The "wall clock" time of a seek operation and rotational delays is in msec. One of the sort comparisons is in nsecs. Thousands of comparisons can be performed in the time it takes to do one seek.
Which way actually works out to be fastest "wall clock wise" in practice depends upon a lot of variables, including a lot of OS and file system configuration stuff (like details of the disk caching,etc.). As well as the details of exactly what kind of data format we are dealing with - size of records, how many, length of the file. Right now all we know is the the files are <1GB and will not fit into memory.
So like many things with programming, there are a lot of "yeah but's". The OP says he is having trouble solving the problem by any method. It could be that the run time doesn't matter and that what matters is the implementation effort. In that case, I would recommend the easiest to implement - whatever that is for the OP.
The observation that it is always going to be faster if all of the data can fit into memory is absolutely true. This is worth a considerable amount of coding effort if the task is to be performed often. I have no idea of how the OP is going to weigh the different trade-offs. Perhaps he will report back in after he has tried some of these ideas.
| [reply] |
|
|
First, I mostly agree with everything you've said, hence the first line of my OP above: "More of a guess given your lack of details,".
But there are a couple of points worth picking up on:
There can be trouble with step 3, because it involves seeking around within the files to gather up the records - reading randomly selected records usually takes a lot longer than reading sequential records.
Agreed. But this is often more than balanced out by the merge steps of disk-based sorts.
Disk-based sorts get around the memory problem by reading a chunk of the source, sorting it and writing it to a temp file; read another chunk, sort and write to temp.
Whilst the reading and writing of the chunks are sequential, they are interleaved, and that alone has a substantial performance impact over reading the whole sequentially and then writing the whole sequentially.
But of far bigger significance is the merge phase of the temporary files. Whilst the reading of records from a given temp file is sequential, collectively they are random. This has at least as much impact (disk seeks) as reading a single file randomly. But it also has a secondary impact, that of forcing the file-cache to juggle multiple active files, and the impacts of that are far harder to predict and can be substantial.
The problem is the nature of the patterns of usage in the merge phase. Sometimes many records will be read from one temporary file, then many from another, and then another and then (say) back to the first. At other points, the access pattern will skip from file to file for every record.
This randomly changing pattern of usage not only confuses most intelligent caching algorithms by continually switching from mostly sequential to mostly random and all points in between. The confusion can actively work against performance by (say) flushing newly-read, part-accessed blocks from the cache prematurely because previous access patterns have suggested that the access is random and so the rest the block will not be used soon. Or conversely, retaining part-accessed blocks far longer than is good for the rest of the system because access patterns suggest sequential access and so it will be needed soon.
Most (if not all) caching heuristics only consider a single file at a time, even when multiple files are being accessed by the same process, so in effect, the multiple open files of the merge phase of an external sort act as if competing processes were at work.
So the sequential reading of records randomly distributed amongst a number of temporary files is at least as costly, and usually much more so, than the random access of records from just two files. And remember that in the sort&merge approach, there are three merge phases, not one as with the hash&memory method.
The "wall clock" time of a seek operation and rotational delays is in msec. One of the sort comparisons is in nsecs. Thousands of comparisons can be performed in the time it takes to do one seek.
Yes, but ... :)
With the sort&merge, every record of both files have to be read from and written back to disk twice each during the two sorts. Once when sorting the chunks; once when merging them. and then each record has to be read and written a third time when merging the two sorted files.
Assuming (for simplicity) equal sized files:
So, even discounting the {2 * P * O( N/P log(N/P)} sorting .v. {O( 2N )} hashing differences, that's { 6N-SR + 2N-RR + 8N-SW } versus { 2N-SR + 2N-RR + 2N-SW }. That's a lot of wall time to be made up.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
|
|
|
|
|
|
|
Re: Working on huge (GB sized) files
by LanX (Saint) on May 12, 2011 at 15:20 UTC
|
| [reply] |
Re: Working on huge (GB sized) files
by Marshall (Canon) on May 12, 2011 at 15:39 UTC
|
Your post is pretty light on specifics. Perhaps this will work: Process each file, emit intermediate file where each record is a single line and that single line has the "common field" replicated at the beginning. Do that to both files.
Then concatenate files into one file with cat. Use system sort on command line for that file. Now all records that have the same "common field" are adjacent. Process that file to do what you want.
If the input files are in CSV, with the right options to the sort command, you can sort on an arbitrary field.
The system's sort command doesn't have to have all the data in memory at once and it will make temp files and do whatever it needs to do in order to sort this huge file. This can be faster than you might imagine. Your code only needs to deal with a small number of input lines at a time. Let system sort deal with the job of getting relevant records adjacent in the file. | [reply] |
|
|
The approach (having common field at starting- concatenating files and sorting) you have mentioned has worked out for my requirement. Thank you!!!
| [reply] |
Re: Working on huge (GB sized) files
by choroba (Cardinal) on May 12, 2011 at 15:03 UTC
|
Cannot you just use plain join? | [reply] |
Re: Working on huge (GB sized) files
by roboticus (Chancellor) on May 12, 2011 at 22:46 UTC
|
| [reply] |