in reply to Combining Files

Update: After talking with ImpalaSS, modified code to handle multiple entries in the first file with the same matching key.


Quick solution. Not the best way, as it is not very flexible, but gets the job done. Requires the two files to be joined as the first and second arguemnts, and the file to be created as the third. Outputs any "unmached" records.

#!/dev/null/perl -- :) use strict; my $first = shift or die "Need the first file!\n"; my $second = shift or die "Need the second file!\n"; my $third = shift or die "Need the third file!\n"; ## May as well test them all now open(FIRST, "$first") or die "Could not open $first: $!\n"; open(SECOND, "$second") or die "Could not open $second: $!\n"; open(THIRD, ">$third") or die "Could not write $third: $!\n"; my (%first, %found); while(<FIRST>) { my ($key, @cols) = split(m#\|# => $_, -1); push(@{$first{$key}}, \@cols); $found{$key} = $.; } close(FIRST); while(<SECOND>) { my ($one, $key, @cols) = split(m#\|# => $_, -1); if (exists $first{$key}) { delete $found{$key}; for (@{$first{$key}}) { print THIRD "$_->[0]|$one|$cols[0]\n"; } } else { printf "Line %5d: Record %5d exists in $second but not in $first\n +", $.,$key; } } close(SECOND); close(THIRD); ## Double check the first file: for (sort {$found{$a} <=> $found{$b}} keys %found) { printf "Line %5d: Record %5d exists in $first but not in $second\n", $found{$_},$_; }

Replies are listed 'Best First'.
Re: Re: Combining Files
by lemming (Priest) on Jan 24, 2001 at 00:02 UTC

    Very similar to code I wrote for a like project. I found that if there was a huge amount of data that it would bog the machine down. The way I found around this was to work with sorted data (using the Unix sort command) and then to buffer the input by reading from the first file for many lines, then go to the second file and output all the matching lines and keep reading in the second file until you fill your buffer, go to the first, etc...

    Note, I was working with data I knew and I was able to tune the buffering for the machine it ran on.