Re^2: Matching data between huge files

Indeed. Though this also assumes that the "record-ids" are unique in both files -- which they may well be, of course.

For completeness one would recommend checking the validity of the "record-ids" as the two files are read, to ensure that they are (a) numeric, (b) unique and (c) in ascending order -- so that one can have confidence in the result.

As usual it's worth examining the problem before writing the code. For example:

how often is this going to be done ?
If the "data" is to be searched in this way many times, then it may be worth pre-processing it into something that can be looked-up rather than scanned -- especially if the number of "record-ids" in the "ids" file is relatively small. Or, create an auxiliary index of the "data".
how often does the "data" change ?
Depending on how often the "data" changes, it may not be worth transforming it, or it may be necessary to be able to directly modify the pre-processed form after each change.
is the "ids" file also 'huge' ?
If the "ids" file is relatively small, then reading that into a hash or a bitmap, and then scanning the "data" file looks straightforward; with the advantage of not depending on any ordering.
can the searches be batched ?
If multiple "ids" files are to be processed, it might make sense to run them in parallel in a single scan of the 'huge' "data".

Ah well. Coding first -- analysis afterwards [1]. How often do we do that ?

[1] As the Queen of Hearts might have it.

Comment on Re^2: Matching data between huge files

Replies are listed 'Best First'.
Re^3: Matching data between huge files by est (Acolyte) on Aug 28, 2008 at 00:54 UTC
Update: both of the files do not sorted, record_id is not unique in file-1, and file-2 is equally big (or bigger) and most likely going to change weekly. Having said that, I really like the solution given by BrowserUk in the sense that my Benchmark gives a much faster result compares to my linear solution and I don't need to build any DB. I haven't checked the memory usage with "vec()" though but I don't think I need to do that as BrowserUk has given an estimated comparison with a hash slurping :-) Thanks.	[reply]

Replies are listed 'Best First'.

Re^3: Matching data between huge files
by est (Acolyte) on Aug 28, 2008 at 00:54 UTC

Update:

[reply]