in reply to Re: Re: Re: many to many join on text files
in thread many to many join on text files

Egad, man that's awful! To put it in database terminology, you're doing a double-nested-loops-outer-join... over a full-table-scan! To quote the Simpsons: "That's bad".

What you should be doing for a large equi-join like this is a method called "merge-join". The concept is: sort both files first, on the columns you wish to join by, then open up each file, and advance together between the two files. Think of a zipper.

Here's some rough code based on yours (bear in mind that I'm not being super-perfect with this, particularly the initial sort... I'm trying to demonstrate an algorithm):

# assuming this is *nix, or something with a sort utility, otherwise t +his can be # done directly in perl system("sort holds > tmpholds") and die; system("sort copies > tmpcopies") and die; open(HOLDS,"<tmpholds") or die; my (@holds, $holdseof); sub readhold { ($_=<HOLDS>) || $holdseof++; chomp; @holds = split(/\|/ +,$_,-1); } readhold; open(COPIES,"<tmpcopies") or die; my (@copies, $copieseof); sub readcopy { ($_=<COPIES>) || $copieseof++; chomp; @copies = split(/ +\|/,$_,-1); } readcopy; while(!($holdseof && $copieseof)) { if ($holdseof || (!$copieseof && $holds[0] gt $copies[0])) { print "copy ($copies[0])\n"; readcopy; } elsif ($copieseof || $copies[0] gt $holds[0]) { print "hold ($holds[0])\n"; readhold; } else { print "hold and copy ($holds[0])\n"; readhold; readcopy; } } close HOLDS; close COPIES; __END__ holds ------ iiiii asdf fdd dsafe dsaf bfer rewtewt bfret zzzzzzzzz copies ------ weewr dddddd rewtewt bfret fdfdsfsdfdsa dsafe dsaf asdf fdd bfer output ------ hold and copy (asdf) hold and copy (bfer) hold and copy (bfret) copy (dddddd) hold and copy (dsaf) hold and copy (dsafe) hold and copy (fdd) copy (fdfdsfsdfdsa) hold (iiiii) hold and copy (rewtewt) copy (weewr) hold (zzzzzzzzz)
------------ :Wq Not an editor command: Wq
FYI: this is how a database would do it (on a large table... a small table might be done via a hash join, which, if implemented in perl, is exactly what it sounds like).

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: many to many join on text files
by tilly (Archbishop) on Apr 15, 2004 at 06:39 UTC
    As his original question suggested, and Re: Re: Re: Re: Re: Re: many to many join on text files clarified, the actual dataset has a many to many joins, and where each side is many, he wants every combination to be represented. Which can be done with walking in parallel, but it requires some backtracking logic that can be tricky to get right.

    The BTREE solution that I gave is very similar to pre-sorting both and walking in parallel. In particular, a BTREE is an ordered structure which is not completely filled, but is close to it. The details are all handled by DB_File at the C level, and should be reasonably efficient.