Re: Matching data between huge files

It seems that both files are sorted. In that case, you can go through both files once - in parallel. No hashes needed, no slurping in entire files. For instance (untested!):

open my $h_data, "<", "file-1" or die $!;
open my $h_id,   "<", "file-2" or die $!;
my $data = <$h_data>;
my $id = <$id>;
while (defined $data && defined $id) {
    no warnings 'numeric';
    given ($data <=> $id) {
        when (-1) {$data = <$h_data>}
        when ( 0) {print $data; $data = <$h_data>; $id = <$h_id>;}
        when ( 1) {$id = <$h_id>;}
    }
}
[download]

Comment on Re: Matching data between huge files Download Code

Replies are listed 'Best First'.
Re^2: Matching data between huge files by gone2015 (Deacon) on Aug 27, 2008 at 13:18 UTC
Indeed. Though this also assumes that the "record-ids" are unique in both files -- which they may well be, of course. For completeness one would recommend checking the validity of the "record-ids" as the two files are read, to ensure that they are (a) numeric, (b) unique and (c) in ascending order -- so that one can have confidence in the result. As usual it's worth examining the problem before writing the code. For example: how often is this going to be done ? If the "data" is to be searched in this way many times, then it may be worth pre-processing it into something that can be looked-up rather than scanned -- especially if the number of "record-ids" in the "ids" file is relatively small. Or, create an auxiliary index of the "data". how often does the "data" change ? Depending on how often the "data" changes, it may not be worth transforming it, or it may be necessary to be able to directly modify the pre-processed form after each change. is the "ids" file also 'huge' ? If the "ids" file is relatively small, then reading that into a hash or a bitmap, and then scanning the "data" file looks straightforward; with the advantage of not depending on any ordering. can the searches be batched ? If multiple "ids" files are to be processed, it might make sense to run them in parallel in a single scan of the 'huge' "data". Ah well. Coding first -- analysis afterwards [1]. How often do we do that ? [1] As the Queen of Hearts might have it.	[reply]
Re^3: Matching data between huge files by est (Acolyte) on Aug 28, 2008 at 00:54 UTC
Update: both of the files do not sorted, record_id is not unique in file-1, and file-2 is equally big (or bigger) and most likely going to change weekly. Having said that, I really like the solution given by BrowserUk in the sense that my Benchmark gives a much faster result compares to my linear solution and I don't need to build any DB. I haven't checked the memory usage with "vec()" though but I don't think I need to do that as BrowserUk has given an estimated comparison with a hash slurping :-) Thanks.	[reply]