in reply to many to many join on text files

As merlyn says a RDBMS is probably the best solution, however if the task is simple and the data sorted you can possibly do it in perl much faster than with a RDBMS. What precisely do you want to do? More specifically is the data sorted (or could it be), unique/non unique ie given this input is there a neat way in perl to generate this output.

cheers

tachyon

Replies are listed 'Best First'.
Re: Re: many to many join on text files
by aquarium (Curate) on Apr 14, 2004 at 23:36 UTC
    yeah...i can sort it with "sort" shell command. i'll look into sqllite in a moment...hope they support a full outer join, as that's what i need.
        ....let's say it does support full outer join....is there a statement (that some rdbms have) to load delimited text files into tables without first creating the tables and specifying each column?
Re: Re: many to many join on text files
by aquarium (Curate) on Apr 15, 2004 at 00:29 UTC
    ..full outer join not implemented in DBD::Sqlite....damn if i go for rdbms it'll take me ages to just load up the tables, as they have varying number of columns and a lot of them (max 47 columns in table). The data is utf-8 with chinese characters in some columns. i'm just matching on first column, which is a plain old long number. trying to do the full join now in perl. it's already very slow....and this is just on the 1000 line subsets of the files!
      finished writing the full join in perl (after which have to do a sort and uniq to strip duplicates)....it's been running for 10 minutes now and still hasn't finished processing first file.....here's the code
      open(HOLDS,"<holds") or die; while($hold=<HOLDS>) { chomp $hold; @holds = split(/\|/,$hold,-1); $lookup = $holds[0]; open(COPIES,"<copies") or die; undef $matched; while($copy=<COPIES>) { chomp $copy; @copies = split(/\|/,$copy,-1); $matchfield = $copies[0]; if($lookup eq $matchfield) { $matched = 1; print "hold and copy\n"; } } if(!$matched) { print "hold\n"; } close COPIES; } close HOLDS; open(COPIES,"<copies") or die; while($copy=<COPIES>) { chomp $copy; @copies = split(/\|/,$copy,-1); $lookup = $copies[0]; open(HOLDS,"<holds") or die; undef $matched; while($hold=<HOLDS>) { chomp $hold; @holds = split(/\|/,$hold,-1); $matchfield = $holds[0]; if($lookup eq $matchfield) { $matched = 1; print "copy and hold\n"; } } if(!$matched) { print "copy\n"; } close HOLDS; } close COPIES;
        Here is a more scalable solution which does what yours does, that should handle large amounts of data (using some disk). If your data fits in memory nicely, then you may get away with replacing the files with undef, making the dbm be held in RAM.
        #! /usr/bin/perl -w use strict; use DB_File; use vars qw($DB_BTREE); # This was exported by DB_File # Allow the btree's to have multiple entries per key $DB_BTREE->{flags} = R_DUP; # DB_File wants you to create these with a tie, so I will even though # I'm ignoring the tied hash. unlink("holds.dbm"); # In case it is there my $btree_holds = tie my %hash_holds, 'DB_File', "holds.dbm", O_RDWR|O_CREAT, 0666, $DB_BTREE or die "Cannot create btree 'holds.dbm': $!"; unlink("copies.dbm"); # In case it is there my $btree_copies = tie my %hash_copies, 'DB_File', "copies.dbm", O_RDWR|O_CREAT, 0666, $DB_BTREE or die "Cannot create btree 'copies.dbm': $!"; open(COPIES, "<copies") or die "Can't open 'copies': $!"; while (<COPIES>) { chomp; (my $lookup) = split /\|/, $_; $btree_copies->put($lookup, $_); } open(HOLDS, "<holds") or die "Can't open 'holds': $!"; while (<HOLDS>) { chomp(my $value = $_); (my $lookup) = split /\|/, $value; $btree_holds->put($lookup, $value); if ($btree_copies->get_dup($lookup)) { foreach my $other_value ($btree_copies->get_dup($lookup)) { print "hold and copy for $lookup\n"; } } else { print "hold for $lookup\n"; } } # Walk copies using the tree. Note that the API is somewhat obscure.. +. for ( my $status = $btree_copies->seq(my $lookup, my $value, R_FIRST); 0 == $status; $status = $btree_copies->seq($lookup, $value, R_NEXT) ) { if ($btree_holds->get_dup($lookup)) { foreach my $other_value ($btree_holds->get_dup($lookup)) { print "copy and hold for $lookup\n"; } } else { print "copy for $lookup\n"; } } sub END { $btree_holds = $btree_copies = undef; untie %hash_holds; untie %hash_copies; unlink($_) for 'holds.dbm', 'copies.dbm'; }
        Note that for any real use, you probably don't want to both do "hold and copy" and "copy and hold" since they are synonymous.
        Egad, man that's awful! To put it in database terminology, you're doing a double-nested-loops-outer-join... over a full-table-scan! To quote the Simpsons: "That's bad".

        What you should be doing for a large equi-join like this is a method called "merge-join". The concept is: sort both files first, on the columns you wish to join by, then open up each file, and advance together between the two files. Think of a zipper.

        Here's some rough code based on yours (bear in mind that I'm not being super-perfect with this, particularly the initial sort... I'm trying to demonstrate an algorithm):

        # assuming this is *nix, or something with a sort utility, otherwise t +his can be # done directly in perl system("sort holds > tmpholds") and die; system("sort copies > tmpcopies") and die; open(HOLDS,"<tmpholds") or die; my (@holds, $holdseof); sub readhold { ($_=<HOLDS>) || $holdseof++; chomp; @holds = split(/\|/ +,$_,-1); } readhold; open(COPIES,"<tmpcopies") or die; my (@copies, $copieseof); sub readcopy { ($_=<COPIES>) || $copieseof++; chomp; @copies = split(/ +\|/,$_,-1); } readcopy; while(!($holdseof && $copieseof)) { if ($holdseof || (!$copieseof && $holds[0] gt $copies[0])) { print "copy ($copies[0])\n"; readcopy; } elsif ($copieseof || $copies[0] gt $holds[0]) { print "hold ($holds[0])\n"; readhold; } else { print "hold and copy ($holds[0])\n"; readhold; readcopy; } } close HOLDS; close COPIES; __END__ holds ------ iiiii asdf fdd dsafe dsaf bfer rewtewt bfret zzzzzzzzz copies ------ weewr dddddd rewtewt bfret fdfdsfsdfdsa dsafe dsaf asdf fdd bfer output ------ hold and copy (asdf) hold and copy (bfer) hold and copy (bfret) copy (dddddd) hold and copy (dsaf) hold and copy (dsafe) hold and copy (fdd) copy (fdfdsfsdfdsa) hold (iiiii) hold and copy (rewtewt) copy (weewr) hold (zzzzzzzzz)
        ------------ :Wq Not an editor command: Wq
        FYI: this is how a database would do it (on a large table... a small table might be done via a hash join, which, if implemented in perl, is exactly what it sounds like).

        Hmm, does this represent everything the program needs to do? Because if, so, I note that you only ever use the first field of each line - you can load the first field of all the lines in both files into memory, and avoid the painfully slow re-reading; something like (untested):

        my $hold = readfile('holds'); my $copy = readfile('copies'); for my $key (keys %$hold) { if ($copy{$key}) { print "$key: hold and copy (or copy and hold)\n"; } } sub readfile { my $file = shift; my $hash = {}; open(my $fh, "<$file") or die "$file: $!"; local $_; while (<$fh>) { # fields are '|' delimited - pick up the first field my $key = substr $_, 0, index($_, "|"); ++$hash->{$key}; } close $fh; return $hash; }

        Even if this is only the starting point, and the real code needs to access all the fields, you could for example cache in memory the first field and the offset into the file for each row, and then use seek() to locate the complete record whenever you need it.

        Hugo

        Something like the code below reduces it to one pass. It assumes that the two files are both pre-sorted on they key field.
        The idea is to maintain a buffer containing a window of all the adjacent lines in the second file that have the same current key. As the key increases, the current buffer is thrown away and the next chunk of lines is read in (stopping when the key changes). Then read in the first file 1 line at a time and get its key. If the key is less than the current key for the buffer, print the line; if it's greater, print the accumulated lines from the second file and refill the buffer. If they're the same, print out the current line from file 1 with each of the lines in the buffer. The code below doesn't actually work yet; it needs more work to ensure that the buffer is flushed at the right times, etc, and doesn't handle EOFs correctly. But I'm supposed to working rather than messing on perlMonks...

        #!/usr/bin/perl -w use strict; open my $f1, 'a'; open my $f2, 'b'; my ($key2, @rest2, $nkey2, $nrest2); # read in next N lines from f2 that have the same key sub get_next_block { @rest2 = (); while (1) { if (defined $nkey2) { push @rest2, $nrest2; $key2 = $nkey2; } my $line2 = <$f2>; return 0 unless defined $line2; ($nkey2, $nrest2) = split / /, $line2; chomp $nrest2; last if defined $key2 && $nkey2 ne $key2; } } get_next_block(); OUTER: while (defined (my $line1 = <$f1>)) { my ($key1, $rest1) = split / /, $line1; chomp $rest1; if ($key1 gt $key2) { print "$key2 $_\n" for @rest2; get_next_block(); next; } if ($key1 lt $key2) { print $line1; next; } print "$key1 $rest1 $_\n" for @rest2; } print while (<$f1>); print while (<$f2>);