in reply to Re: many to many join on text files
in thread many to many join on text files

..full outer join not implemented in DBD::Sqlite....damn if i go for rdbms it'll take me ages to just load up the tables, as they have varying number of columns and a lot of them (max 47 columns in table). The data is utf-8 with chinese characters in some columns. i'm just matching on first column, which is a plain old long number. trying to do the full join now in perl. it's already very slow....and this is just on the 1000 line subsets of the files!
  • Comment on Re: Re: many to many join on text files

Replies are listed 'Best First'.
Re: Re: Re: many to many join on text files
by aquarium (Curate) on Apr 15, 2004 at 00:37 UTC
    finished writing the full join in perl (after which have to do a sort and uniq to strip duplicates)....it's been running for 10 minutes now and still hasn't finished processing first file.....here's the code
    open(HOLDS,"<holds") or die; while($hold=<HOLDS>) { chomp $hold; @holds = split(/\|/,$hold,-1); $lookup = $holds[0]; open(COPIES,"<copies") or die; undef $matched; while($copy=<COPIES>) { chomp $copy; @copies = split(/\|/,$copy,-1); $matchfield = $copies[0]; if($lookup eq $matchfield) { $matched = 1; print "hold and copy\n"; } } if(!$matched) { print "hold\n"; } close COPIES; } close HOLDS; open(COPIES,"<copies") or die; while($copy=<COPIES>) { chomp $copy; @copies = split(/\|/,$copy,-1); $lookup = $copies[0]; open(HOLDS,"<holds") or die; undef $matched; while($hold=<HOLDS>) { chomp $hold; @holds = split(/\|/,$hold,-1); $matchfield = $holds[0]; if($lookup eq $matchfield) { $matched = 1; print "copy and hold\n"; } } if(!$matched) { print "copy\n"; } close HOLDS; } close COPIES;
      Here is a more scalable solution which does what yours does, that should handle large amounts of data (using some disk). If your data fits in memory nicely, then you may get away with replacing the files with undef, making the dbm be held in RAM.
      #! /usr/bin/perl -w use strict; use DB_File; use vars qw($DB_BTREE); # This was exported by DB_File # Allow the btree's to have multiple entries per key $DB_BTREE->{flags} = R_DUP; # DB_File wants you to create these with a tie, so I will even though # I'm ignoring the tied hash. unlink("holds.dbm"); # In case it is there my $btree_holds = tie my %hash_holds, 'DB_File', "holds.dbm", O_RDWR|O_CREAT, 0666, $DB_BTREE or die "Cannot create btree 'holds.dbm': $!"; unlink("copies.dbm"); # In case it is there my $btree_copies = tie my %hash_copies, 'DB_File', "copies.dbm", O_RDWR|O_CREAT, 0666, $DB_BTREE or die "Cannot create btree 'copies.dbm': $!"; open(COPIES, "<copies") or die "Can't open 'copies': $!"; while (<COPIES>) { chomp; (my $lookup) = split /\|/, $_; $btree_copies->put($lookup, $_); } open(HOLDS, "<holds") or die "Can't open 'holds': $!"; while (<HOLDS>) { chomp(my $value = $_); (my $lookup) = split /\|/, $value; $btree_holds->put($lookup, $value); if ($btree_copies->get_dup($lookup)) { foreach my $other_value ($btree_copies->get_dup($lookup)) { print "hold and copy for $lookup\n"; } } else { print "hold for $lookup\n"; } } # Walk copies using the tree. Note that the API is somewhat obscure.. +. for ( my $status = $btree_copies->seq(my $lookup, my $value, R_FIRST); 0 == $status; $status = $btree_copies->seq($lookup, $value, R_NEXT) ) { if ($btree_holds->get_dup($lookup)) { foreach my $other_value ($btree_holds->get_dup($lookup)) { print "copy and hold for $lookup\n"; } } else { print "copy for $lookup\n"; } } sub END { $btree_holds = $btree_copies = undef; untie %hash_holds; untie %hash_copies; unlink($_) for 'holds.dbm', 'copies.dbm'; }
      Note that for any real use, you probably don't want to both do "hold and copy" and "copy and hold" since they are synonymous.
      Egad, man that's awful! To put it in database terminology, you're doing a double-nested-loops-outer-join... over a full-table-scan! To quote the Simpsons: "That's bad".

      What you should be doing for a large equi-join like this is a method called "merge-join". The concept is: sort both files first, on the columns you wish to join by, then open up each file, and advance together between the two files. Think of a zipper.

      Here's some rough code based on yours (bear in mind that I'm not being super-perfect with this, particularly the initial sort... I'm trying to demonstrate an algorithm):

      # assuming this is *nix, or something with a sort utility, otherwise t +his can be # done directly in perl system("sort holds > tmpholds") and die; system("sort copies > tmpcopies") and die; open(HOLDS,"<tmpholds") or die; my (@holds, $holdseof); sub readhold { ($_=<HOLDS>) || $holdseof++; chomp; @holds = split(/\|/ +,$_,-1); } readhold; open(COPIES,"<tmpcopies") or die; my (@copies, $copieseof); sub readcopy { ($_=<COPIES>) || $copieseof++; chomp; @copies = split(/ +\|/,$_,-1); } readcopy; while(!($holdseof && $copieseof)) { if ($holdseof || (!$copieseof && $holds[0] gt $copies[0])) { print "copy ($copies[0])\n"; readcopy; } elsif ($copieseof || $copies[0] gt $holds[0]) { print "hold ($holds[0])\n"; readhold; } else { print "hold and copy ($holds[0])\n"; readhold; readcopy; } } close HOLDS; close COPIES; __END__ holds ------ iiiii asdf fdd dsafe dsaf bfer rewtewt bfret zzzzzzzzz copies ------ weewr dddddd rewtewt bfret fdfdsfsdfdsa dsafe dsaf asdf fdd bfer output ------ hold and copy (asdf) hold and copy (bfer) hold and copy (bfret) copy (dddddd) hold and copy (dsaf) hold and copy (dsafe) hold and copy (fdd) copy (fdfdsfsdfdsa) hold (iiiii) hold and copy (rewtewt) copy (weewr) hold (zzzzzzzzz)
      ------------ :Wq Not an editor command: Wq
      FYI: this is how a database would do it (on a large table... a small table might be done via a hash join, which, if implemented in perl, is exactly what it sounds like).
        As his original question suggested, and Re: Re: Re: Re: Re: Re: many to many join on text files clarified, the actual dataset has a many to many joins, and where each side is many, he wants every combination to be represented. Which can be done with walking in parallel, but it requires some backtracking logic that can be tricky to get right.

        The BTREE solution that I gave is very similar to pre-sorting both and walking in parallel. In particular, a BTREE is an ordered structure which is not completely filled, but is close to it. The details are all handled by DB_File at the C level, and should be reasonably efficient.

      Hmm, does this represent everything the program needs to do? Because if, so, I note that you only ever use the first field of each line - you can load the first field of all the lines in both files into memory, and avoid the painfully slow re-reading; something like (untested):

      my $hold = readfile('holds'); my $copy = readfile('copies'); for my $key (keys %$hold) { if ($copy{$key}) { print "$key: hold and copy (or copy and hold)\n"; } } sub readfile { my $file = shift; my $hash = {}; open(my $fh, "<$file") or die "$file: $!"; local $_; while (<$fh>) { # fields are '|' delimited - pick up the first field my $key = substr $_, 0, index($_, "|"); ++$hash->{$key}; } close $fh; return $hash; }

      Even if this is only the starting point, and the real code needs to access all the fields, you could for example cache in memory the first field and the offset into the file for each row, and then use seek() to locate the complete record whenever you need it.

      Hugo

        He specified many to many. Using a hash like that will cut the many down to just 1, and lose data.

        (I would assume that in the full application he will do something more interesting than just print a message.)

      Something like the code below reduces it to one pass. It assumes that the two files are both pre-sorted on they key field.
      The idea is to maintain a buffer containing a window of all the adjacent lines in the second file that have the same current key. As the key increases, the current buffer is thrown away and the next chunk of lines is read in (stopping when the key changes). Then read in the first file 1 line at a time and get its key. If the key is less than the current key for the buffer, print the line; if it's greater, print the accumulated lines from the second file and refill the buffer. If they're the same, print out the current line from file 1 with each of the lines in the buffer. The code below doesn't actually work yet; it needs more work to ensure that the buffer is flushed at the right times, etc, and doesn't handle EOFs correctly. But I'm supposed to working rather than messing on perlMonks...

      #!/usr/bin/perl -w use strict; open my $f1, 'a'; open my $f2, 'b'; my ($key2, @rest2, $nkey2, $nrest2); # read in next N lines from f2 that have the same key sub get_next_block { @rest2 = (); while (1) { if (defined $nkey2) { push @rest2, $nrest2; $key2 = $nkey2; } my $line2 = <$f2>; return 0 unless defined $line2; ($nkey2, $nrest2) = split / /, $line2; chomp $nrest2; last if defined $key2 && $nkey2 ne $key2; } } get_next_block(); OUTER: while (defined (my $line1 = <$f1>)) { my ($key1, $rest1) = split / /, $line1; chomp $rest1; if ($key1 gt $key2) { print "$key2 $_\n" for @rest2; get_next_block(); next; } if ($key1 lt $key2) { print $line1; next; } print "$key1 $rest1 $_\n" for @rest2; } print while (<$f1>); print while (<$f2>);