Re: Re: Re: Re: many to many join on text files
by tilly (Archbishop) on Apr 15, 2004 at 02:02 UTC
|
Here is a more scalable solution which does what yours does, that should handle large amounts of data (using some disk). If your data fits in memory nicely, then you may get away with replacing the files with undef, making the dbm be held in RAM.
#! /usr/bin/perl -w
use strict;
use DB_File;
use vars qw($DB_BTREE); # This was exported by DB_File
# Allow the btree's to have multiple entries per key
$DB_BTREE->{flags} = R_DUP;
# DB_File wants you to create these with a tie, so I will even though
# I'm ignoring the tied hash.
unlink("holds.dbm"); # In case it is there
my $btree_holds = tie my %hash_holds, 'DB_File', "holds.dbm",
O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot create btree 'holds.dbm': $!";
unlink("copies.dbm"); # In case it is there
my $btree_copies = tie my %hash_copies, 'DB_File', "copies.dbm",
O_RDWR|O_CREAT, 0666, $DB_BTREE
or die "Cannot create btree 'copies.dbm': $!";
open(COPIES, "<copies") or die "Can't open 'copies': $!";
while (<COPIES>) {
chomp;
(my $lookup) = split /\|/, $_;
$btree_copies->put($lookup, $_);
}
open(HOLDS, "<holds") or die "Can't open 'holds': $!";
while (<HOLDS>) {
chomp(my $value = $_);
(my $lookup) = split /\|/, $value;
$btree_holds->put($lookup, $value);
if ($btree_copies->get_dup($lookup)) {
foreach my $other_value ($btree_copies->get_dup($lookup)) {
print "hold and copy for $lookup\n";
}
}
else {
print "hold for $lookup\n";
}
}
# Walk copies using the tree. Note that the API is somewhat obscure..
+.
for (
my $status = $btree_copies->seq(my $lookup, my $value, R_FIRST);
0 == $status;
$status = $btree_copies->seq($lookup, $value, R_NEXT)
) {
if ($btree_holds->get_dup($lookup)) {
foreach my $other_value ($btree_holds->get_dup($lookup)) {
print "copy and hold for $lookup\n";
}
}
else {
print "copy for $lookup\n";
}
}
sub END {
$btree_holds = $btree_copies = undef;
untie %hash_holds;
untie %hash_copies;
unlink($_) for 'holds.dbm', 'copies.dbm';
}
Note that for any real use, you probably don't want to both do "hold and copy" and "copy and hold" since they are synonymous. | [reply] [d/l] |
Re: Re: Re: Re: many to many join on text files
by etcshadow (Priest) on Apr 15, 2004 at 03:35 UTC
|
Egad, man that's awful! To put it in database terminology, you're doing a double-nested-loops-outer-join... over a full-table-scan! To quote the Simpsons: "That's bad".
What you should be doing for a large equi-join like this is a method called "merge-join". The concept is: sort both files first, on the columns you wish to join by, then open up each file, and advance together between the two files. Think of a zipper.
Here's some rough code based on yours (bear in mind that I'm not being super-perfect with this, particularly the initial sort... I'm trying to demonstrate an algorithm):
# assuming this is *nix, or something with a sort utility, otherwise t
+his can be
# done directly in perl
system("sort holds > tmpholds") and die;
system("sort copies > tmpcopies") and die;
open(HOLDS,"<tmpholds") or die;
my (@holds, $holdseof);
sub readhold { ($_=<HOLDS>) || $holdseof++; chomp; @holds = split(/\|/
+,$_,-1); }
readhold;
open(COPIES,"<tmpcopies") or die;
my (@copies, $copieseof);
sub readcopy { ($_=<COPIES>) || $copieseof++; chomp; @copies = split(/
+\|/,$_,-1); }
readcopy;
while(!($holdseof && $copieseof)) {
if ($holdseof || (!$copieseof && $holds[0] gt $copies[0])) {
print "copy ($copies[0])\n";
readcopy;
}
elsif ($copieseof || $copies[0] gt $holds[0]) {
print "hold ($holds[0])\n";
readhold;
}
else {
print "hold and copy ($holds[0])\n";
readhold;
readcopy;
}
}
close HOLDS;
close COPIES;
__END__
holds
------
iiiii
asdf
fdd
dsafe
dsaf
bfer
rewtewt
bfret
zzzzzzzzz
copies
------
weewr
dddddd
rewtewt
bfret
fdfdsfsdfdsa
dsafe
dsaf
asdf
fdd
bfer
output
------
hold and copy (asdf)
hold and copy (bfer)
hold and copy (bfret)
copy (dddddd)
hold and copy (dsaf)
hold and copy (dsafe)
hold and copy (fdd)
copy (fdfdsfsdfdsa)
hold (iiiii)
hold and copy (rewtewt)
copy (weewr)
hold (zzzzzzzzz)
------------
:Wq
Not an editor command: Wq
FYI: this is how a database would do it (on a large table... a small table might be done via a hash join, which, if implemented in perl, is exactly what it sounds like).
| [reply] [d/l] [select] |
|
|
As his original question suggested, and Re: Re: Re: Re: Re: Re: many to many join on text files clarified, the actual dataset has a many to many joins, and where each side is many, he wants every combination to be represented. Which can be done with walking in parallel, but it requires some backtracking logic that can be tricky to get right.
The BTREE solution that I gave is very similar to pre-sorting both and walking in parallel. In particular, a BTREE is an ordered structure which is not completely filled, but is close to it. The details are all handled by DB_File at the C level, and should be reasonably efficient.
| [reply] |
Re: Re: Re: Re: many to many join on text files
by hv (Prior) on Apr 15, 2004 at 01:39 UTC
|
Hmm, does this represent everything the program needs to do? Because if, so, I note that you only ever use the first field of each line - you can load the first field of all the lines in both files into memory, and avoid the painfully slow re-reading; something like (untested):
my $hold = readfile('holds');
my $copy = readfile('copies');
for my $key (keys %$hold) {
if ($copy{$key}) {
print "$key: hold and copy (or copy and hold)\n";
}
}
sub readfile {
my $file = shift;
my $hash = {};
open(my $fh, "<$file") or die "$file: $!";
local $_;
while (<$fh>) {
# fields are '|' delimited - pick up the first field
my $key = substr $_, 0, index($_, "|");
++$hash->{$key};
}
close $fh;
return $hash;
}
Even if this is only the starting point, and the real code needs to access all the fields, you could for example cache in memory the first field and the offset into the file for each row, and then use seek() to locate the complete record whenever you need it.
Hugo | [reply] [d/l] |
|
|
| [reply] |
|
|
Well, it doesn't have to lose any data you need to keep - I originally wrote it as $hash->{$key} = 1, and changed it to an increment to preserve the count; it could just as easily store an arrayref of line numbers or file positions for each key. The core concept though is that the primary super-slow operation being performed is searching through a file repeatedly for matches on a key, and if there is any way to turn that into a hash lookup it's likely to improve things as long as there is memory left.
I think we'd need to see some information about the actual task at hand to know whether there's some reason an in-core hash cannot be used, but unless memory is unusually restricted for a modern-day computer it seems unlikely.
Hugo
| [reply] [d/l] |
|
|
|
|
|
|
the nitty gritty of the processing after the full outer join is trivial compared to the join itself. both tables also may contain duplicate rows....which need to match up in full outer join fashion -- this breaks the logic of my program to do a proper full outer join.
I've just tried M$Access...can't find full outer join there...it does have a nice text import wizard though, so columns automatically get field names.
will try mysql quickly next.
| [reply] |
Re: Re: Re: Re: many to many join on text files
by dave_the_m (Monsignor) on Apr 15, 2004 at 13:37 UTC
|
Something like the code below reduces it to one pass. It assumes that the
two files are both pre-sorted on they key field.
The idea is to maintain a buffer containing a window of all the adjacent
lines in the second file that have the same current key. As the key
increases, the current buffer is thrown away and the next chunk of lines
is read in (stopping when the key changes). Then
read in the first file 1 line at a time and get its key. If the key is
less than the current key for the buffer, print the line; if it's greater,
print the accumulated lines from the second file and refill the buffer. If
they're the same, print out the current line from file 1 with each of the
lines in the buffer. The code below doesn't actually work yet; it needs
more work to ensure that the buffer is flushed at the right times, etc, and
doesn't handle EOFs correctly. But I'm supposed to working rather than
messing on perlMonks...
#!/usr/bin/perl -w
use strict;
open my $f1, 'a';
open my $f2, 'b';
my ($key2, @rest2, $nkey2, $nrest2);
# read in next N lines from f2 that have the same key
sub get_next_block {
@rest2 = ();
while (1) {
if (defined $nkey2) {
push @rest2, $nrest2;
$key2 = $nkey2;
}
my $line2 = <$f2>;
return 0 unless defined $line2;
($nkey2, $nrest2) = split / /, $line2;
chomp $nrest2;
last if defined $key2 && $nkey2 ne $key2;
}
}
get_next_block();
OUTER:
while (defined (my $line1 = <$f1>)) {
my ($key1, $rest1) = split / /, $line1;
chomp $rest1;
if ($key1 gt $key2) {
print "$key2 $_\n" for @rest2;
get_next_block();
next;
}
if ($key1 lt $key2) {
print $line1;
next;
}
print "$key1 $rest1 $_\n" for @rest2;
}
print while (<$f1>);
print while (<$f2>);
| [reply] [d/l] |