Re: Re: Re: many to many join on text files

Replies are listed 'Best First'.
Re: Re: Re: Re: many to many join on text files by tilly (Archbishop) on Apr 15, 2004 at 02:02 UTC
Here is a more scalable solution which does what yours does, that should handle large amounts of data (using some disk). If your data fits in memory nicely, then you may get away with replacing the files with undef, making the dbm be held in RAM. #! /usr/bin/perl -w use strict; use DB_File; use vars qw($DB_BTREE); # This was exported by DB_File # Allow the btree's to have multiple entries per key $DB_BTREE->{flags} = R_DUP; # DB_File wants you to create these with a tie, so I will even though # I'm ignoring the tied hash. unlink("holds.dbm"); # In case it is there my $btree_holds = tie my %hash_holds, 'DB_File', "holds.dbm", O_RDWR\|O_CREAT, 0666, $DB_BTREE or die "Cannot create btree 'holds.dbm': $!"; unlink("copies.dbm"); # In case it is there my $btree_copies = tie my %hash_copies, 'DB_File', "copies.dbm", O_RDWR\|O_CREAT, 0666, $DB_BTREE or die "Cannot create btree 'copies.dbm': $!"; open(COPIES, "<copies") or die "Can't open 'copies': $!"; while (<COPIES>) { chomp; (my $lookup) = split /\\|/, $_; $btree_copies->put($lookup, $_); } open(HOLDS, "<holds") or die "Can't open 'holds': $!"; while (<HOLDS>) { chomp(my $value = $_); (my $lookup) = split /\\|/, $value; $btree_holds->put($lookup, $value); if ($btree_copies->get_dup($lookup)) { foreach my $other_value ($btree_copies->get_dup($lookup)) { print "hold and copy for $lookup\n"; } } else { print "hold for $lookup\n"; } } # Walk copies using the tree. Note that the API is somewhat obscure.. +. for ( my $status = $btree_copies->seq(my $lookup, my $value, R_FIRST); 0 == $status; $status = $btree_copies->seq($lookup, $value, R_NEXT) ) { if ($btree_holds->get_dup($lookup)) { foreach my $other_value ($btree_holds->get_dup($lookup)) { print "copy and hold for $lookup\n"; } } else { print "copy for $lookup\n"; } } sub END { $btree_holds = $btree_copies = undef; untie %hash_holds; untie %hash_copies; unlink($_) for 'holds.dbm', 'copies.dbm'; } [download] Note that for any real use, you probably don't want to both do "hold and copy" and "copy and hold" since they are synonymous.	[reply] [d/l]
Re: Re: Re: Re: many to many join on text files by etcshadow (Priest) on Apr 15, 2004 at 03:35 UTC
Egad, man that's awful! To put it in database terminology, you're doing a double-nested-loops-outer-join... over a full-table-scan! To quote the Simpsons: "That's bad". What you should be doing for a large equi-join like this is a method called "merge-join". The concept is: sort both files first, on the columns you wish to join by, then open up each file, and advance together between the two files. Think of a zipper. Here's some rough code based on yours (bear in mind that I'm not being super-perfect with this, particularly the initial sort... I'm trying to demonstrate an algorithm): # assuming this is *nix, or something with a sort utility, otherwise t +his can be # done directly in perl system("sort holds > tmpholds") and die; system("sort copies > tmpcopies") and die; open(HOLDS,"<tmpholds") or die; my (@holds, $holdseof); sub readhold { ($_=<HOLDS>) \|\| $holdseof++; chomp; @holds = split(/\\|/ +,$_,-1); } readhold; open(COPIES,"<tmpcopies") or die; my (@copies, $copieseof); sub readcopy { ($_=<COPIES>) \|\| $copieseof++; chomp; @copies = split(/ +\\|/,$_,-1); } readcopy; while(!($holdseof && $copieseof)) { if ($holdseof \|\| (!$copieseof && $holds[0] gt $copies[0])) { print "copy ($copies[0])\n"; readcopy; } elsif ($copieseof \|\| $copies[0] gt $holds[0]) { print "hold ($holds[0])\n"; readhold; } else { print "hold and copy ($holds[0])\n"; readhold; readcopy; } } close HOLDS; close COPIES; __END__ holds ------ iiiii asdf fdd dsafe dsaf bfer rewtewt bfret zzzzzzzzz copies ------ weewr dddddd rewtewt bfret fdfdsfsdfdsa dsafe dsaf asdf fdd bfer output ------ hold and copy (asdf) hold and copy (bfer) hold and copy (bfret) copy (dddddd) hold and copy (dsaf) hold and copy (dsafe) hold and copy (fdd) copy (fdfdsfsdfdsa) hold (iiiii) hold and copy (rewtewt) copy (weewr) hold (zzzzzzzzz) [download] `------------ :Wq Not an editor command: Wq` [download] FYI: this is how a database would do it (on a large table... a small table might be done via a hash join, which, if implemented in perl, is exactly what it sounds like).	[reply] [d/l] [select]
Re: Re: Re: Re: Re: many to many join on text files by tilly (Archbishop) on Apr 15, 2004 at 06:39 UTC
As his original question suggested, and Re: Re: Re: Re: Re: Re: many to many join on text files clarified, the actual dataset has a many to many joins, and where each side is many, he wants every combination to be represented. Which can be done with walking in parallel, but it requires some backtracking logic that can be tricky to get right. The BTREE solution that I gave is very similar to pre-sorting both and walking in parallel. In particular, a BTREE is an ordered structure which is not completely filled, but is close to it. The details are all handled by DB_File at the C level, and should be reasonably efficient.	[reply]
Re: Re: Re: Re: many to many join on text files by hv (Prior) on Apr 15, 2004 at 01:39 UTC
Hmm, does this represent everything the program needs to do? Because if, so, I note that you only ever use the first field of each line - you can load the first field of all the lines in both files into memory, and avoid the painfully slow re-reading; something like (untested): `my $hold = readfile('holds'); my $copy = readfile('copies'); for my $key (keys %$hold) { if ($copy{$key}) { print "$key: hold and copy (or copy and hold)\n"; } } sub readfile { my $file = shift; my $hash = {}; open(my $fh, "<$file") or die "$file: $!"; local $_; while (<$fh>) { # fields are '\|' delimited - pick up the first field my $key = substr $_, 0, index($_, "\|"); ++$hash->{$key}; } close $fh; return $hash; }` [download] Even if this is only the starting point, and the real code needs to access all the fields, you could for example cache in memory the first field and the offset into the file for each row, and then use seek() to locate the complete record whenever you need it. Hugo	[reply] [d/l]
Re: Re: Re: Re: Re: many to many join on text files by tilly (Archbishop) on Apr 15, 2004 at 01:49 UTC
He specified many to many. Using a hash like that will cut the many down to just 1, and lose data. (I would assume that in the full application he will do something more interesting than just print a message.)	[reply]
Re^6: many to many join on text files by hv (Prior) on Apr 15, 2004 at 01:57 UTC
Well, it doesn't have to lose any data you need to keep - I originally wrote it as `$hash->{$key} = 1`, and changed it to an increment to preserve the count; it could just as easily store an arrayref of line numbers or file positions for each key. The core concept though is that the primary super-slow operation being performed is searching through a file repeatedly for matches on a key, and if there is any way to turn that into a hash lookup it's likely to improve things as long as there is memory left. I think we'd need to see some information about the actual task at hand to know whether there's some reason an in-core hash cannot be used, but unless memory is unusually restricted for a modern-day computer it seems unlikely. Hugo	[reply] [d/l]
Re: Re^6: many to many join on text files by tilly (Archbishop) on Apr 15, 2004 at 02:09 UTC
Re: Re: Re^6: many to many join on text files by hv (Prior) on Apr 15, 2004 at 02:30 UTC
Re: Re: Re: Re: Re: Re: many to many join on text files by aquarium (Curate) on Apr 15, 2004 at 01:56 UTC
the nitty gritty of the processing after the full outer join is trivial compared to the join itself. both tables also may contain duplicate rows....which need to match up in full outer join fashion -- this breaks the logic of my program to do a proper full outer join. I've just tried M$Access...can't find full outer join there...it does have a nice text import wizard though, so columns automatically get field names. will try mysql quickly next.	[reply]
Re: Re: Re: Re: many to many join on text files by dave_the_m (Monsignor) on Apr 15, 2004 at 13:37 UTC
Something like the code below reduces it to one pass. It assumes that the two files are both pre-sorted on they key field. The idea is to maintain a buffer containing a window of all the adjacent lines in the second file that have the same current key. As the key increases, the current buffer is thrown away and the next chunk of lines is read in (stopping when the key changes). Then read in the first file 1 line at a time and get its key. If the key is less than the current key for the buffer, print the line; if it's greater, print the accumulated lines from the second file and refill the buffer. If they're the same, print out the current line from file 1 with each of the lines in the buffer. The code below doesn't actually work yet; it needs more work to ensure that the buffer is flushed at the right times, etc, and doesn't handle EOFs correctly. But I'm supposed to working rather than messing on perlMonks... #!/usr/bin/perl -w use strict; open my $f1, 'a'; open my $f2, 'b'; my ($key2, @rest2, $nkey2, $nrest2); # read in next N lines from f2 that have the same key sub get_next_block { @rest2 = (); while (1) { if (defined $nkey2) { push @rest2, $nrest2; $key2 = $nkey2; } my $line2 = <$f2>; return 0 unless defined $line2; ($nkey2, $nrest2) = split / /, $line2; chomp $nrest2; last if defined $key2 && $nkey2 ne $key2; } } get_next_block(); OUTER: while (defined (my $line1 = <$f1>)) { my ($key1, $rest1) = split / /, $line1; chomp $rest1; if ($key1 gt $key2) { print "$key2 $_\n" for @rest2; get_next_block(); next; } if ($key1 lt $key2) { print $line1; next; } print "$key1 $rest1 $_\n" for @rest2; } print while (<$f1>); print while (<$f2>); [download]	[reply] [d/l]