in reply to Help creating HASH for file comparison
It seems like you're using two criteria where one is sufficient. The SEQUENCE_NUMBER may just be a distraction; all you really want to know is if the first ten fields are identical or not, and if not, do some further processing.
Also, you mention that there may be a million lines per file that you're comparing. Is it a million? Or could it be millions? Are you concerned about swamping physical RAM? Or is that not an issue?
The following solution takes an MD5 hash of the first ten elements per line of the first file, and stores it along with the line number from the first file, and the actual line in a MLDBM tied to a hash. The MD5 is the hash key, and the line number and actual line are held in an array ref. This is repeated for each line of the first file.
Then one pass is taken through the second file. Again, for each line in the 2nd file, an MD5 hash is generated for the first ten elements. If that MD5 hash is found to be a key in our tied hash, we know (with a high degree of confidence) that we have a collision. You might perform additional processing, but all I did was to print both lines and both line numbers.
I altered your input data set by copying one line from file two into file one, just to test. Here's the code I used:
use strict; use warnings; use Digest::MD5 qw( md5_hex ); use MLDBM; use Fcntl; tie my %o, 'MLDBM', 'tempdb', O_CREAT|O_RDWR, 0640 or die $!; # First file we populate the hash. process_file( 'infile1.txt', \%o, sub { $_[0]->{$_[1]} = [ $., $_[2 +] ]; } ); # Second file we check for collisions. process_file( 'infile2.txt', \%o, sub { my( $tied, $hash, $line ) = @_; print "\nCollision: infile1.txt line $tied->{$hash}->[0]:\n", "\t($tied->{$hash}->[1])\n", "-- collides with: infile2.txt line $.:\n", "\t($line)\n\n" if exists $tied->{$hash}; } ); END { untie %o; unlink glob 'tempdb.*'; } # RAII. sub process_file { my( $filename, $tied, $code ) = @_; open my $infh, '<', $filename or die $!; while( my $line = <$infh> ) { my ( $wanted ) = $line =~ m/((?:[^,]*,){10})/; next unless length $wanted; chomp $line; my $hash = md5_hex( $wanted ); $code->($tied, $hash, $line); } close $infh; }
A pure hash (as opposed to one tied to a database) would be more time-efficient, but would not scale well if your file sizes climb to millions of lines. This approach will scale fairly well.
A small optimization might be to go back to using the SEQUENCE_NUMBER as a hash key, and then if a partial collision is found, hash the first ten elements of that line from file 1 and file 2 to detect a full collision. But what happens if a single sequence number occurs more than once in a given file? (The first approach would be tolerant of that, whereas the second wouldn't.) Either approach assumes that there are no "full" collisions within a single file.Dave
|
|---|