in reply to Matching data between huge files

It seems that both files are sorted. In that case, you can go through both files once - in parallel. No hashes needed, no slurping in entire files. For instance (untested!):
open my $h_data, "<", "file-1" or die $!; open my $h_id, "<", "file-2" or die $!; my $data = <$h_data>; my $id = <$id>; while (defined $data && defined $id) { no warnings 'numeric'; given ($data <=> $id) { when (-1) {$data = <$h_data>} when ( 0) {print $data; $data = <$h_data>; $id = <$h_id>;} when ( 1) {$id = <$h_id>;} } }

Replies are listed 'Best First'.
Re^2: Matching data between huge files
by gone2015 (Deacon) on Aug 27, 2008 at 13:18 UTC

    Indeed. Though this also assumes that the "record-ids" are unique in both files -- which they may well be, of course.

    For completeness one would recommend checking the validity of the "record-ids" as the two files are read, to ensure that they are (a) numeric, (b) unique and (c) in ascending order -- so that one can have confidence in the result.

    As usual it's worth examining the problem before writing the code. For example:

    • how often is this going to be done ?

      If the "data" is to be searched in this way many times, then it may be worth pre-processing it into something that can be looked-up rather than scanned -- especially if the number of "record-ids" in the "ids" file is relatively small. Or, create an auxiliary index of the "data".

    • how often does the "data" change ?

      Depending on how often the "data" changes, it may not be worth transforming it, or it may be necessary to be able to directly modify the pre-processed form after each change.

    • is the "ids" file also 'huge' ?

      If the "ids" file is relatively small, then reading that into a hash or a bitmap, and then scanning the "data" file looks straightforward; with the advantage of not depending on any ordering.

    • can the searches be batched ?

      If multiple "ids" files are to be processed, it might make sense to run them in parallel in a single scan of the 'huge' "data".

    Ah well. Coding first -- analysis afterwards [1]. How often do we do that ?

    [1] As the Queen of Hearts might have it.

      Update: both of the files do *not* sorted, record_id is *not* unique in file-1, and file-2 is equally big (or bigger) and most likely going to change weekly.

      Having said that, I really like the solution given by BrowserUk in the sense that my Benchmark gives a much faster result compares to my linear solution and I don't need to build any DB.

      I haven't checked the memory usage with "vec()" though but I don't think I need to do that as BrowserUk has given an estimated comparison with a hash slurping :-)

      Thanks.