in reply to Re: Re: Re: many to many join on text files
in thread many to many join on text files

Something like the code below reduces it to one pass. It assumes that the two files are both pre-sorted on they key field.
The idea is to maintain a buffer containing a window of all the adjacent lines in the second file that have the same current key. As the key increases, the current buffer is thrown away and the next chunk of lines is read in (stopping when the key changes). Then read in the first file 1 line at a time and get its key. If the key is less than the current key for the buffer, print the line; if it's greater, print the accumulated lines from the second file and refill the buffer. If they're the same, print out the current line from file 1 with each of the lines in the buffer. The code below doesn't actually work yet; it needs more work to ensure that the buffer is flushed at the right times, etc, and doesn't handle EOFs correctly. But I'm supposed to working rather than messing on perlMonks...

#!/usr/bin/perl -w use strict; open my $f1, 'a'; open my $f2, 'b'; my ($key2, @rest2, $nkey2, $nrest2); # read in next N lines from f2 that have the same key sub get_next_block { @rest2 = (); while (1) { if (defined $nkey2) { push @rest2, $nrest2; $key2 = $nkey2; } my $line2 = <$f2>; return 0 unless defined $line2; ($nkey2, $nrest2) = split / /, $line2; chomp $nrest2; last if defined $key2 && $nkey2 ne $key2; } } get_next_block(); OUTER: while (defined (my $line1 = <$f1>)) { my ($key1, $rest1) = split / /, $line1; chomp $rest1; if ($key1 gt $key2) { print "$key2 $_\n" for @rest2; get_next_block(); next; } if ($key1 lt $key2) { print $line1; next; } print "$key1 $rest1 $_\n" for @rest2; } print while (<$f1>); print while (<$f2>);