in reply to Unpacking small chucks of data quickly
Think of “memory” as being “a disk file,” not semiconductor chips on a circuit-board. Because in a modern computer system that's what it really is: virtual memory.
So, if you have "100 megabytes of anything at all," you do not want to build an in-memory hash of it for any reason whatever.
It appears to me that what you want to have, as the output of your program, is a structure which lists, for each "ID", all of the "VALs" for that ID. Therefore, let me suggest an alternate approach. (Or rather, second what has already been suggested.)
Sort the file, using an on-disk sort like the sort command. When you do that, first by ID and then by VAL, you know that all of the records having any given ID-value will be adjacent to one another.
So now, you read the sorted file sequentially, and you remember what the “previous” ID was, to see if it is the same or different as “this” one. If different, then the end of one group has been reached and the start of a new one has begun. If the same, you have another ID to be added to the current group. Finally, when the end of the file has been reached, you're by-definition at the end of the final group.
Yes, this is literally what those folks were doing with those punched cards and magnetic tapes, all those years ago even before computers were invented.
A hundred megs? Oh, I'd be very surprised if it took even five seconds, after the sort is through. And the sort won't take long either.
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Unpacking small chucks of data quickly
by spectre9 (Beadle) on Nov 21, 2007 at 15:59 UTC | |
by BrowserUk (Patriarch) on Nov 21, 2007 at 19:50 UTC |