Re: Re: Re: many to many join on text files
by aquarium (Curate) on Apr 15, 2004 at 00:37 UTC
|
finished writing the full join in perl (after which have to do a sort and uniq to strip duplicates)....it's been running for 10 minutes now and still hasn't finished processing first file.....here's the code
open(HOLDS,"<holds") or die;
while($hold=<HOLDS>) {
chomp $hold;
@holds = split(/\|/,$hold,-1);
$lookup = $holds[0];
open(COPIES,"<copies") or die;
undef $matched;
while($copy=<COPIES>) {
chomp $copy;
@copies = split(/\|/,$copy,-1);
$matchfield = $copies[0];
if($lookup eq $matchfield) {
$matched = 1;
print "hold and copy\n";
}
}
if(!$matched) {
print "hold\n";
}
close COPIES;
}
close HOLDS;
open(COPIES,"<copies") or die;
while($copy=<COPIES>) {
chomp $copy;
@copies = split(/\|/,$copy,-1);
$lookup = $copies[0];
open(HOLDS,"<holds") or die;
undef $matched;
while($hold=<HOLDS>) {
chomp $hold;
@holds = split(/\|/,$hold,-1);
$matchfield = $holds[0];
if($lookup eq $matchfield) {
$matched = 1;
print "copy and hold\n";
}
}
if(!$matched) {
print "copy\n";
}
close HOLDS;
}
close COPIES;
| [reply] [d/l] |
|
|
Here is a more scalable solution which does what yours does, that should handle large amounts of data (using some disk). If your data fits in memory nicely, then you may get away with replacing the files with undef, making the dbm be held in RAM.
#! /usr/bin/perl -w
use strict;
use DB_File;
use vars qw($DB_BTREE); # This was exported by DB_File
# Allow the btree's to have multiple entries per key
$DB_BTREE->{flags} = R_DUP;
# DB_File wants you to create these with a tie, so I will even though
# I'm ignoring the tied hash.
unlink("holds.dbm"); # In case it is there
my $btree_holds = tie my %hash_holds, 'DB_File', "holds.dbm",
O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot create btree 'holds.dbm': $!";
unlink("copies.dbm"); # In case it is there
my $btree_copies = tie my %hash_copies, 'DB_File', "copies.dbm",
O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot create btree 'copies.dbm': $!";
open(COPIES, "<copies") or die "Can't open 'copies': $!";
while (<COPIES>) {
chomp;
(my $lookup) = split /\|/, $_;
$btree_copies->put($lookup, $_);
}
open(HOLDS, "<holds") or die "Can't open 'holds': $!";
while (<HOLDS>) {
chomp(my $value = $_);
(my $lookup) = split /\|/, $value;
$btree_holds->put($lookup, $value);
if ($btree_copies->get_dup($lookup)) {
foreach my $other_value ($btree_copies->get_dup($lookup)) {
print "hold and copy for $lookup\n";
}
}
else {
print "hold for $lookup\n";
}
}
# Walk copies using the tree. Note that the API is somewhat obscure..
+.
for (
my $status = $btree_copies->seq(my $lookup, my $value, R_FIRST);
0 == $status;
$status = $btree_copies->seq($lookup, $value, R_NEXT)
) {
if ($btree_holds->get_dup($lookup)) {
foreach my $other_value ($btree_holds->get_dup($lookup)) {
print "copy and hold for $lookup\n";
}
}
else {
print "copy for $lookup\n";
}
}
sub END {
$btree_holds = $btree_copies = undef;
untie %hash_holds;
untie %hash_copies;
unlink($_) for 'holds.dbm', 'copies.dbm';
}
Note that for any real use, you probably don't want to both do "hold and copy" and "copy and hold" since they are synonymous. | [reply] [d/l] |
|
|
Egad, man that's awful! To put it in database terminology, you're doing a double-nested-loops-outer-join... over a full-table-scan! To quote the Simpsons: "That's bad".
What you should be doing for a large equi-join like this is a method called "merge-join". The concept is: sort both files first, on the columns you wish to join by, then open up each file, and advance together between the two files. Think of a zipper.
Here's some rough code based on yours (bear in mind that I'm not being super-perfect with this, particularly the initial sort... I'm trying to demonstrate an algorithm):
# assuming this is *nix, or something with a sort utility, otherwise t
+his can be
# done directly in perl
system("sort holds > tmpholds") and die;
system("sort copies > tmpcopies") and die;
open(HOLDS,"<tmpholds") or die;
my (@holds, $holdseof);
sub readhold { ($_=<HOLDS>) || $holdseof++; chomp; @holds = split(/\|/
+,$_,-1); }
readhold;
open(COPIES,"<tmpcopies") or die;
my (@copies, $copieseof);
sub readcopy { ($_=<COPIES>) || $copieseof++; chomp; @copies = split(/
+\|/,$_,-1); }
readcopy;
while(!($holdseof && $copieseof)) {
if ($holdseof || (!$copieseof && $holds[0] gt $copies[0])) {
print "copy ($copies[0])\n";
readcopy;
}
elsif ($copieseof || $copies[0] gt $holds[0]) {
print "hold ($holds[0])\n";
readhold;
}
else {
print "hold and copy ($holds[0])\n";
readhold;
readcopy;
}
}
close HOLDS;
close COPIES;
__END__
holds
------
iiiii
asdf
fdd
dsafe
dsaf
bfer
rewtewt
bfret
zzzzzzzzz
copies
------
weewr
dddddd
rewtewt
bfret
fdfdsfsdfdsa
dsafe
dsaf
asdf
fdd
bfer
output
------
hold and copy (asdf)
hold and copy (bfer)
hold and copy (bfret)
copy (dddddd)
hold and copy (dsaf)
hold and copy (dsafe)
hold and copy (fdd)
copy (fdfdsfsdfdsa)
hold (iiiii)
hold and copy (rewtewt)
copy (weewr)
hold (zzzzzzzzz)
------------
:Wq
Not an editor command: Wq
FYI: this is how a database would do it (on a large table... a small table might be done via a hash join, which, if implemented in perl, is exactly what it sounds like).
| [reply] [d/l] [select] |
|
|
As his original question suggested, and Re: Re: Re: Re: Re: Re: many to many join on text files clarified, the actual dataset has a many to many joins, and where each side is many, he wants every combination to be represented. Which can be done with walking in parallel, but it requires some backtracking logic that can be tricky to get right.
The BTREE solution that I gave is very similar to pre-sorting both and walking in parallel. In particular, a BTREE is an ordered structure which is not completely filled, but is close to it. The details are all handled by DB_File at the C level, and should be reasonably efficient.
| [reply] |
|
|
Hmm, does this represent everything the program needs to do? Because if, so, I note that you only ever use the first field of each line - you can load the first field of all the lines in both files into memory, and avoid the painfully slow re-reading; something like (untested):
my $hold = readfile('holds');
my $copy = readfile('copies');
for my $key (keys %$hold) {
if ($copy{$key}) {
print "$key: hold and copy (or copy and hold)\n";
}
}
sub readfile {
my $file = shift;
my $hash = {};
open(my $fh, "<$file") or die "$file: $!";
local $_;
while (<$fh>) {
# fields are '|' delimited - pick up the first field
my $key = substr $_, 0, index($_, "|");
++$hash->{$key};
}
close $fh;
return $hash;
}
Even if this is only the starting point, and the real code needs to access all the fields, you could for example cache in memory the first field and the offset into the file for each row, and then use seek() to locate the complete record whenever you need it.
Hugo | [reply] [d/l] |
|
|
| [reply] |
|
|
|
|
|
|
|
|
|
Something like the code below reduces it to one pass. It assumes that the
two files are both pre-sorted on they key field.
The idea is to maintain a buffer containing a window of all the adjacent
lines in the second file that have the same current key. As the key
increases, the current buffer is thrown away and the next chunk of lines
is read in (stopping when the key changes). Then
read in the first file 1 line at a time and get its key. If the key is
less than the current key for the buffer, print the line; if it's greater,
print the accumulated lines from the second file and refill the buffer. If
they're the same, print out the current line from file 1 with each of the
lines in the buffer. The code below doesn't actually work yet; it needs
more work to ensure that the buffer is flushed at the right times, etc, and
doesn't handle EOFs correctly. But I'm supposed to working rather than
messing on perlMonks...
#!/usr/bin/perl -w
use strict;
open my $f1, 'a';
open my $f2, 'b';
my ($key2, @rest2, $nkey2, $nrest2);
# read in next N lines from f2 that have the same key
sub get_next_block {
@rest2 = ();
while (1) {
if (defined $nkey2) {
push @rest2, $nrest2;
$key2 = $nkey2;
}
my $line2 = <$f2>;
return 0 unless defined $line2;
($nkey2, $nrest2) = split / /, $line2;
chomp $nrest2;
last if defined $key2 && $nkey2 ne $key2;
}
}
get_next_block();
OUTER:
while (defined (my $line1 = <$f1>)) {
my ($key1, $rest1) = split / /, $line1;
chomp $rest1;
if ($key1 gt $key2) {
print "$key2 $_\n" for @rest2;
get_next_block();
next;
}
if ($key1 lt $key2) {
print $line1;
next;
}
print "$key1 $rest1 $_\n" for @rest2;
}
print while (<$f1>);
print while (<$f2>);
| [reply] [d/l] |