comment on

Here is a more scalable solution which does what yours does, that should handle large amounts of data (using some disk). If your data fits in memory nicely, then you may get away with replacing the files with undef, making the dbm be held in RAM.

#! /usr/bin/perl -w
use strict;
use DB_File;
use vars qw($DB_BTREE); # This was exported by DB_File

# Allow the btree's to have multiple entries per key
$DB_BTREE->{flags} = R_DUP;


# DB_File wants you to create these with a tie, so I will even though
# I'm ignoring the tied hash.
unlink("holds.dbm"); # In case it is there
my $btree_holds = tie my %hash_holds, 'DB_File', "holds.dbm",
  O_RDWR|O_CREAT, 0666, $DB_BTREE
    or die "Cannot create btree 'holds.dbm': $!";

unlink("copies.dbm"); # In case it is there
my $btree_copies = tie my %hash_copies, 'DB_File', "copies.dbm",
  O_RDWR|O_CREAT, 0666, $DB_BTREE
    or die "Cannot create btree 'copies.dbm': $!";

open(COPIES, "<copies") or die "Can't open 'copies': $!";
while (<COPIES>) {
  chomp;
  (my $lookup) = split /\|/, $_;
  $btree_copies->put($lookup, $_);
}

open(HOLDS, "<holds") or die "Can't open 'holds': $!";
while (<HOLDS>) {
  chomp(my $value = $_);
  (my $lookup) = split /\|/, $value;
  $btree_holds->put($lookup, $value);

  if ($btree_copies->get_dup($lookup)) {
    foreach my $other_value ($btree_copies->get_dup($lookup)) {
      print "hold and copy for $lookup\n";
    }
  }
  else {
    print "hold for $lookup\n";
  }
}

# Walk copies using the tree.  Note that the API is somewhat obscure..
+.
for (
  my $status = $btree_copies->seq(my $lookup, my $value, R_FIRST);
  0 == $status;
  $status = $btree_copies->seq($lookup, $value, R_NEXT)
) {
  if ($btree_holds->get_dup($lookup)) {
    foreach my $other_value ($btree_holds->get_dup($lookup)) {
      print "copy and hold for $lookup\n";
    }
  }
  else {
    print "copy for $lookup\n";
  }
}

sub END {
  $btree_holds = $btree_copies = undef;
  untie %hash_holds;
  untie %hash_copies;
  unlink($_) for 'holds.dbm', 'copies.dbm';
}
[download]

Note that for any real use, you probably don't want to both do "hold and copy" and "copy and hold" since they are synonymous.

In reply to Re: Re: Re: Re: many to many join on text files by tilly
in thread many to many join on text files by aquarium

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.