est has asked for the wisdom of the Perl Monks concerning the following question:

Update 29/08/2008:
Using vec() as suggested by BrowserUk below works fine - but unfortunately not fast enough for matching +500Mb file-1 with a one million lines in file-2.
It took more than one hour of processing time *sigh*
In the end, sorted file-2 then use File::SortedSeek works much faster.


Hi fellow monks,

I'm trying to find a best solution to match data between 2 huge files. The content of the files will be like this:

file-1: file-2 1000 - foo 2000 2000 - bar 3000 3000 - fubar .... ... 990000 1000000 - heck

And the code should just print out everything in file-1 that is matched with the Id in file-2.

An example of solution would be:

A linear solution:
use autodie qw(open close); open my $fip, '<', 'file-1'; open my $fop, '<', 'file-2'; LINE1: while (my $line1 = <$fip>) { my @token = split(/-/, $line1, 2); while (my $line2 = <$fop>) { chomp $line2; if ($token[0] == $line2) { print $line1; next LINE1; } } } close $fip; close $fop;

Which works fine, except toooooo slow for comparing more than a million line in both files as it needs to redo the inner looping again and again.

Another solution would be to use a hash table to store all Id in file-2 and check if $hash{$token[0]) exists, e.g.:

use autodie qw(open close); open my $fip, '<', 'file-1'; open my $fop, '<', 'file-2'; my %record_id; while (my $line = <$fop>) { chomp $line; $record_id{$line} = 1; } close $fop; while (my $line = <$fip>) { my @token = split(/-/, $line, 2); if ( exist $record_id{$token[0]} ) { print $line; } } close $fip;

Which works really fast, but then I don't want to load a million lines into memory at once.

What would be the best solution/approach in that situation?
Any CPAN module available that I can use out of the box?

Thanks,
Eddy

Replies are listed 'Best First'.
Re: Matching data between huge files
by BrowserUk (Patriarch) on Aug 27, 2008 at 08:05 UTC
    Which works really fast, but then I don't want to load a million lines into memory at once.

    Why not?

    As the "keys" for matching are numeric, you can save some space by using a bitstring for your lookup:

    use autodie qw(open close); open my $fip, '<', 'file-1'; open my $fop, '<', 'file-2'; my $lookup = ''; while (my $line = <$fop>) { chomp $line; vec( $lookup, $line, 1 } = 1; } close $fop; while (my $line = <$fip>) { my @token = split(/-/, $line, 2); if ( vec( $lookup, $token[0], 1 } ) { print $line; } } close $fip;

    For a range upto 1 million, that will use <1/4MB for the lookup, rather than close to 100MB and should be at least as fast.

    Update: Actually it is ~10x faster than the hash lookup, though you probably won't notice the difference as you will be limited by the time taken to read the second file from disk. But in any case, it will be 2 or 3 orders of magnitude faster than any DB solution.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Matching data between huge files
by Corion (Patriarch) on Aug 27, 2008 at 07:45 UTC

    How big is really big? I see no problem with loading a million keys into memory. For other approaches, I've used a tied hash like SDBM_File to store the keys on disk. A possibly even faster approach is to sort both files by key, or have them retrieved in sorted fashion. Then you can do away with the hash alltogether and use a variation of a merge sort, advancing through both files line by line, discarding/keeping lines that match.

    Personally, I would first try the "slurp one file into a hash" approach, because that's the simplest approach. Even if you go slightly over the amount of physical memory in the box, the decline in runtime will likely not be worth any effort spent on this. And it's always easy to modify such a program to use a tied hash on disk instead.

Re: Matching data between huge files
by tilly (Archbishop) on Aug 27, 2008 at 07:57 UTC
    If the second file is sorted, you could use the core module Search::Dict for this. It would find an index in a million line file in no more than about 20 disk reads.

    If you want a more efficient lookup, you can use a dbm file of some sort for this. People usually recommend BerkeleyDB (also accessible through DB_File) for this. Perl comes with SDBM_File. The drawback is that you have to build the dbm_file. For very large files I recommend sorting your input data and using a btree implementation (that is an option with Berkeley DB).

    A third option that is worth considering is that you could use DBI and DBD::SQLite to create a real relational database. For this simple problem it is not worthwhile. But if your data structure is going to get more complex, or you might have several related tasks like this, then that is well worth considering.

Re: Matching data between huge files
by JavaFan (Canon) on Aug 27, 2008 at 10:54 UTC
    It seems that both files are sorted. In that case, you can go through both files once - in parallel. No hashes needed, no slurping in entire files. For instance (untested!):
    open my $h_data, "<", "file-1" or die $!; open my $h_id, "<", "file-2" or die $!; my $data = <$h_data>; my $id = <$id>; while (defined $data && defined $id) { no warnings 'numeric'; given ($data <=> $id) { when (-1) {$data = <$h_data>} when ( 0) {print $data; $data = <$h_data>; $id = <$h_id>;} when ( 1) {$id = <$h_id>;} } }

      Indeed. Though this also assumes that the "record-ids" are unique in both files -- which they may well be, of course.

      For completeness one would recommend checking the validity of the "record-ids" as the two files are read, to ensure that they are (a) numeric, (b) unique and (c) in ascending order -- so that one can have confidence in the result.

      As usual it's worth examining the problem before writing the code. For example:

      • how often is this going to be done ?

        If the "data" is to be searched in this way many times, then it may be worth pre-processing it into something that can be looked-up rather than scanned -- especially if the number of "record-ids" in the "ids" file is relatively small. Or, create an auxiliary index of the "data".

      • how often does the "data" change ?

        Depending on how often the "data" changes, it may not be worth transforming it, or it may be necessary to be able to directly modify the pre-processed form after each change.

      • is the "ids" file also 'huge' ?

        If the "ids" file is relatively small, then reading that into a hash or a bitmap, and then scanning the "data" file looks straightforward; with the advantage of not depending on any ordering.

      • can the searches be batched ?

        If multiple "ids" files are to be processed, it might make sense to run them in parallel in a single scan of the 'huge' "data".

      Ah well. Coding first -- analysis afterwards [1]. How often do we do that ?

      [1] As the Queen of Hearts might have it.

        Update: both of the files do *not* sorted, record_id is *not* unique in file-1, and file-2 is equally big (or bigger) and most likely going to change weekly.

        Having said that, I really like the solution given by BrowserUk in the sense that my Benchmark gives a much faster result compares to my linear solution and I don't need to build any DB.

        I haven't checked the memory usage with "vec()" though but I don't think I need to do that as BrowserUk has given an estimated comparison with a hash slurping :-)

        Thanks.
Re: Matching data between huge files
by moritz (Cardinal) on Aug 27, 2008 at 07:51 UTC

    Update: Disregard, makes no sense. BrowserUk's solution is much better.

    If you want to do it multiple times, the best solution could be to load it into a relation database (think postgres, mysql) and do a join over the two tables.

      In my experience, this is only true if you need to partition both sides into three sets:

      • The elements in both sets
      • The elements only in the left set
      • The elements only in the right set

      You can formulate these as three SQL queries:

      SELECT left.keycol as onlyleft FROM left LEFT JOIN right ON left.keycol = right.keycol WHERE right.keycol IS NULL SELECT right.keycol as onlyright FROM right LEFT JOIN left ON left.keycol = right.keycol WHERE left.keycol IS NULL SELECT right.keycol as onlyright FROM right INNER JOIN left ON left.keycol = right.keycol

      But you can get the same results by using a hash and "colliding out" the elements from the hash and then looking what's left over. The SQL approach does not handle duplicate keys well, while the hash-based approach is extended trivially to handle duplicate keys by storing arrayrefs of the rows instead of just using the keys as indicators.

      I have to confesss, I've done lots and lots of such comparison programs. I haven't collected together the code into a released framework yet, because there are different business rules for when two rows are to be considered equal/identical, and I haven't found a convenient way to hide that in a framework.