Help for finding duplicates in huge files

julio_514 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help for finding duplicates in huge files by roboticus (Chancellor) on Jul 16, 2011 at 02:45 UTC
julio_514: A simple variation of a merge sort would do the trick. I did a node a while back with code to do it. You'd just need to change the part where it merges to compare and dump the value in the appropriate output files. A brief (untested) example: #!/usr/bin/perl use strict; use warnings; my ($infile1, $infile2) = ('abc', 'def'); my ($outfile1, $outfile2, $outfile3) = ('out.both', 'out.1', 'out.2'); open my $IF1, "sort -k3 $infile1 \|' or die "opening $infile1: $!"; open my $IF2, 'sort -k3 $infile2 \|' or die "opening $infile2: $!"; open my $OF1, '>', $outfile1 or die "opening $file1: $!"; open my $OF1, '>', $outfile2 or die "opening $file2: $!"; open my $OF1, '>', $outfile3 or die "opening $file3: $!"; my $f1 = <$IF1>; my $f2 = <$IF2>; while (defined($f1) or defined($f2)) { if ($f1 eq $f2) { print $OF1 $f1; $f1 = <$IF1>; $f2 = <$IF2>; } elsif ($f1 lt $f2) { print $OF2 $f1; $f1 = <$IF1>; } else { print $OF3 $f2; $f2 = <$IF2>; } } [download] Also, if you plan on doing many operations like this, you may just want to dump your data in a database, and let the database do the work. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re: Help for finding duplicates in huge files by Marshall (Canon) on Jul 16, 2011 at 04:07 UTC
With only about 3 million unique id's, a memory resident hash table is completely feasible. An easy approach is to read all of the input data twice and output a subset of the input data once. my %hash; open file1... while (<$file1>) { ... do something to get the id $hash{$id}++; } close file1 open file 2 while (<$file2>) { ... do something to get the id $hash{$id}+=3; } close file 2 open 3 output files: the both file, file1 only and file2 only loop back thru file1 and file2 for each line decide where it goes if hash value of the id ==1 file one only if hash value of the id ==3 file two only if hash value of the id ==4 both files negate the hash value to signal that this id has already been dealt with - each id should only appear in one of the 3 output files. [download] Update: To just get the id's of the lines into the 3 different files, it is not necessary to read the input files the second time - just dump the hash according to the "rules". In the above, I first thought that each id would be associated with some non-trivial amount of data - that's not true. The above algorithm will run very fast - no sorting/merging is required.	[reply] [d/l]
Re: Help for finding duplicates in huge files by CountZero (Bishop) on Jul 16, 2011 at 10:24 UTC
Actually it goes quite fast. I first made your two 1.5 million row files to test. The ID has a lenght of 38 characters and is guaranteed to be unique within each file. Some IDs however can be found in both files. The following program reads both files and produces three arrays: @both (containing the IDs which were found in both files), @first and @second (IDs only found in one of both files). `use Modern::Perl; my %first_file; { open my $FIRST, '<', 'first.txt' or die $!; while (<$FIRST>) { chomp; $first_file{$_} = '' ; } } my (@both, @first, @second); { open my $SECOND, '<', 'second.txt' or die $!; while (<$SECOND>) { chomp; if (exists $first_file{$_}) { push @both, $_; delete $first_file{$_}; } else { push @second, $_; } } } @first = keys %first_file;` [download] On my Windows XP Lenovo X200s laptop running Perl 5.12 it takes 5 seconds to read the first file into the hash and 10 seconds to check the second file against the hash and putting the IDs in their respective arrays. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l]
Re: Help for finding duplicates in huge files by locked_user sundialsvc4 (Abbot) on Jul 16, 2011 at 11:46 UTC
With memory sizes as they are today, 1.5 million rows is “just not that interesting.” If you had data volumes several times larger than that, you could do it like they did it before computers came along: sort both trays of punched-cards into an identically sorted sequence, then run both trays at the same time through the merge-machine. If you are using an SQL database, simply `ORDER BY` the same key. When you process n data streams that you know to be identically sorted, you simply select the smallest of the “current records” from each of the n data streams until all streams are at end-of-file. External sorting algorithms are among the most heavily studied. Dr. Knuth called one of his tomes, Sorting and Searching, for a very good reason.
Re: Help for finding duplicates in huge files by Anonymous Monk on Jul 16, 2011 at 02:12 UTC
Hi, You may already be doing it the most effecient way. Show us some code. Or at least tell us what you have tried. J.C.	[reply]
Re: Help for finding duplicates in huge files by julio_514 (Acolyte) on Jul 16, 2011 at 20:30 UTC
Guys, thank you so much for your suggestions! Using hash tables works like a charm. I won't post my original script because I would be embarrassed :\| I apologize, the post title is misleading - it is true that these are not huge files by today's standards.	[reply]