in reply to Re: Re: Re: many to many join on text files
in thread many to many join on text files

Here is a more scalable solution which does what yours does, that should handle large amounts of data (using some disk). If your data fits in memory nicely, then you may get away with replacing the files with undef, making the dbm be held in RAM.
#! /usr/bin/perl -w use strict; use DB_File; use vars qw($DB_BTREE); # This was exported by DB_File # Allow the btree's to have multiple entries per key $DB_BTREE->{flags} = R_DUP; # DB_File wants you to create these with a tie, so I will even though # I'm ignoring the tied hash. unlink("holds.dbm"); # In case it is there my $btree_holds = tie my %hash_holds, 'DB_File', "holds.dbm", O_RDWR|O_CREAT, 0666, $DB_BTREE or die "Cannot create btree 'holds.dbm': $!"; unlink("copies.dbm"); # In case it is there my $btree_copies = tie my %hash_copies, 'DB_File', "copies.dbm", O_RDWR|O_CREAT, 0666, $DB_BTREE or die "Cannot create btree 'copies.dbm': $!"; open(COPIES, "<copies") or die "Can't open 'copies': $!"; while (<COPIES>) { chomp; (my $lookup) = split /\|/, $_; $btree_copies->put($lookup, $_); } open(HOLDS, "<holds") or die "Can't open 'holds': $!"; while (<HOLDS>) { chomp(my $value = $_); (my $lookup) = split /\|/, $value; $btree_holds->put($lookup, $value); if ($btree_copies->get_dup($lookup)) { foreach my $other_value ($btree_copies->get_dup($lookup)) { print "hold and copy for $lookup\n"; } } else { print "hold for $lookup\n"; } } # Walk copies using the tree. Note that the API is somewhat obscure.. +. for ( my $status = $btree_copies->seq(my $lookup, my $value, R_FIRST); 0 == $status; $status = $btree_copies->seq($lookup, $value, R_NEXT) ) { if ($btree_holds->get_dup($lookup)) { foreach my $other_value ($btree_holds->get_dup($lookup)) { print "copy and hold for $lookup\n"; } } else { print "copy for $lookup\n"; } } sub END { $btree_holds = $btree_copies = undef; untie %hash_holds; untie %hash_copies; unlink($_) for 'holds.dbm', 'copies.dbm'; }
Note that for any real use, you probably don't want to both do "hold and copy" and "copy and hold" since they are synonymous.