Re: Re: Re: Re: many to many join on text files

Egad, man that's awful! To put it in database terminology, you're doing a double-nested-loops-outer-join... over a full-table-scan! To quote the Simpsons: "That's bad".

What you should be doing for a large equi-join like this is a method called "merge-join". The concept is: sort both files first, on the columns you wish to join by, then open up each file, and advance together between the two files. Think of a zipper.

Here's some rough code based on yours (bear in mind that I'm not being super-perfect with this, particularly the initial sort... I'm trying to demonstrate an algorithm):

# assuming this is *nix, or something with a sort utility, otherwise t
+his can be
# done directly in perl
system("sort holds > tmpholds") and die;
system("sort copies > tmpcopies") and die;

open(HOLDS,"<tmpholds") or die;
my (@holds, $holdseof);
sub readhold { ($_=<HOLDS>) || $holdseof++; chomp; @holds = split(/\|/
+,$_,-1); }
readhold;

open(COPIES,"<tmpcopies") or die;
my (@copies, $copieseof);
sub readcopy { ($_=<COPIES>) || $copieseof++; chomp; @copies = split(/
+\|/,$_,-1); }
readcopy;

while(!($holdseof && $copieseof)) {
    if ($holdseof || (!$copieseof && $holds[0] gt $copies[0])) {
        print "copy ($copies[0])\n";
        readcopy;
    }
    elsif ($copieseof || $copies[0] gt $holds[0]) {
        print "hold ($holds[0])\n";
        readhold;
    }
    else {
        print "hold and copy ($holds[0])\n";
        readhold;
        readcopy;
    }
}
close HOLDS;
close COPIES;

__END__

holds
------
iiiii
asdf
fdd
dsafe
dsaf
bfer
rewtewt
bfret
zzzzzzzzz

copies
------
weewr
dddddd
rewtewt
bfret
fdfdsfsdfdsa
dsafe
dsaf
asdf
fdd
bfer

output
------
hold and copy (asdf)
hold and copy (bfer)
hold and copy (bfret)
copy (dddddd)
hold and copy (dsaf)
hold and copy (dsafe)
hold and copy (fdd)
copy (fdfdsfsdfdsa)
hold (iiiii)
hold and copy (rewtewt)
copy (weewr)
hold (zzzzzzzzz)
[download]

------------
:Wq
Not an editor command: Wq
[download]

FYI: this is how a database would do it (on a large table... a small table might be done via a hash join, which, if implemented in perl, is exactly what it sounds like).

Comment on Re: Re: Re: Re: many to many join on text files Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: many to many join on text files by tilly (Archbishop) on Apr 15, 2004 at 06:39 UTC
As his original question suggested, and Re: Re: Re: Re: Re: Re: many to many join on text files clarified, the actual dataset has a many to many joins, and where each side is many, he wants every combination to be represented. Which can be done with walking in parallel, but it requires some backtracking logic that can be tricky to get right. The BTREE solution that I gave is very similar to pre-sorting both and walking in parallel. In particular, a BTREE is an ordered structure which is not completely filled, but is close to it. The details are all handled by DB_File at the C level, and should be reasonably efficient.	[reply]

Replies are listed 'Best First'.

Re: Re: Re: Re: Re: many to many join on text files
by tilly (Archbishop) on Apr 15, 2004 at 06:39 UTC

Re: Re: Re: Re: Re: Re: many to many join on text files

The BTREE solution that I gave is very similar to pre-sorting both and walking in parallel. In particular, a BTREE is an ordered structure which is not completely filled, but is close to it. The details are all handled by DB_File at the C level, and should be reasonably efficient.

[reply]