in reply to Re: Re: Re: many to many join on text files
in thread many to many join on text files
What you should be doing for a large equi-join like this is a method called "merge-join". The concept is: sort both files first, on the columns you wish to join by, then open up each file, and advance together between the two files. Think of a zipper.
Here's some rough code based on yours (bear in mind that I'm not being super-perfect with this, particularly the initial sort... I'm trying to demonstrate an algorithm):
# assuming this is *nix, or something with a sort utility, otherwise t +his can be # done directly in perl system("sort holds > tmpholds") and die; system("sort copies > tmpcopies") and die; open(HOLDS,"<tmpholds") or die; my (@holds, $holdseof); sub readhold { ($_=<HOLDS>) || $holdseof++; chomp; @holds = split(/\|/ +,$_,-1); } readhold; open(COPIES,"<tmpcopies") or die; my (@copies, $copieseof); sub readcopy { ($_=<COPIES>) || $copieseof++; chomp; @copies = split(/ +\|/,$_,-1); } readcopy; while(!($holdseof && $copieseof)) { if ($holdseof || (!$copieseof && $holds[0] gt $copies[0])) { print "copy ($copies[0])\n"; readcopy; } elsif ($copieseof || $copies[0] gt $holds[0]) { print "hold ($holds[0])\n"; readhold; } else { print "hold and copy ($holds[0])\n"; readhold; readcopy; } } close HOLDS; close COPIES; __END__ holds ------ iiiii asdf fdd dsafe dsaf bfer rewtewt bfret zzzzzzzzz copies ------ weewr dddddd rewtewt bfret fdfdsfsdfdsa dsafe dsaf asdf fdd bfer output ------ hold and copy (asdf) hold and copy (bfer) hold and copy (bfret) copy (dddddd) hold and copy (dsaf) hold and copy (dsafe) hold and copy (fdd) copy (fdfdsfsdfdsa) hold (iiiii) hold and copy (rewtewt) copy (weewr) hold (zzzzzzzzz)
FYI: this is how a database would do it (on a large table... a small table might be done via a hash join, which, if implemented in perl, is exactly what it sounds like).------------ :Wq Not an editor command: Wq
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: Re: Re: Re: many to many join on text files
by tilly (Archbishop) on Apr 15, 2004 at 06:39 UTC |