in reply to Re: Matching data between huge files
in thread Matching data between huge files
Indeed. Though this also assumes that the "record-ids" are unique in both files -- which they may well be, of course.
For completeness one would recommend checking the validity of the "record-ids" as the two files are read, to ensure that they are (a) numeric, (b) unique and (c) in ascending order -- so that one can have confidence in the result.
As usual it's worth examining the problem before writing the code. For example:
If the "data" is to be searched in this way many times, then it may be worth pre-processing it into something that can be looked-up rather than scanned -- especially if the number of "record-ids" in the "ids" file is relatively small. Or, create an auxiliary index of the "data".
Depending on how often the "data" changes, it may not be worth transforming it, or it may be necessary to be able to directly modify the pre-processed form after each change.
If the "ids" file is relatively small, then reading that into a hash or a bitmap, and then scanning the "data" file looks straightforward; with the advantage of not depending on any ordering.
If multiple "ids" files are to be processed, it might make sense to run them in parallel in a single scan of the 'huge' "data".
Ah well. Coding first -- analysis afterwards [1]. How often do we do that ?
[1] As the Queen of Hearts might have it.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Matching data between huge files
by est (Acolyte) on Aug 28, 2008 at 00:54 UTC |