julio_514 has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, After a long time, I seek your wisdom once again!

I have two big input files of about 1.5 million rows each (each rows actually consists of a unique ID number). Lets call them file#1 and file#2. The vast majority of these ID numbers occurs one time in BOTH these files and some ID number are exclusive to file#1 OR file#2 only.

What I want to do is to produce an output of three files: 1) one file with IDs occurring in both input files, 2) one file containing IDs unique to file#1 and 3) one file containing IDs unique to file#2.

I mean... I know how to write the code to do this kind of stuff, but its taking forever the way I wrote it. Its the first time I have to deal with huge files like this... Anyone have a suggestion for an efficient implementation for this kind of task?
  • Comment on Help for finding duplicates in huge files

Replies are listed 'Best First'.
Re: Help for finding duplicates in huge files
by roboticus (Chancellor) on Jul 16, 2011 at 02:45 UTC

    julio_514:

    A simple variation of a merge sort would do the trick. I did a node a while back with code to do it. You'd just need to change the part where it merges to compare and dump the value in the appropriate output files. A brief (untested) example:

    #!/usr/bin/perl use strict; use warnings; my ($infile1, $infile2) = ('abc', 'def'); my ($outfile1, $outfile2, $outfile3) = ('out.both', 'out.1', 'out.2'); open my $IF1, "sort -k3 $infile1 |' or die "opening $infile1: $!"; open my $IF2, 'sort -k3 $infile2 |' or die "opening $infile2: $!"; open my $OF1, '>', $outfile1 or die "opening $file1: $!"; open my $OF1, '>', $outfile2 or die "opening $file2: $!"; open my $OF1, '>', $outfile3 or die "opening $file3: $!"; my $f1 = <$IF1>; my $f2 = <$IF2>; while (defined($f1) or defined($f2)) { if ($f1 eq $f2) { print $OF1 $f1; $f1 = <$IF1>; $f2 = <$IF2>; } elsif ($f1 lt $f2) { print $OF2 $f1; $f1 = <$IF1>; } else { print $OF3 $f2; $f2 = <$IF2>; } }

    Also, if you plan on doing many operations like this, you may just want to dump your data in a database, and let the database do the work.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Help for finding duplicates in huge files
by Marshall (Canon) on Jul 16, 2011 at 04:07 UTC
    With only about 3 million unique id's, a memory resident hash table is completely feasible. An easy approach is to read all of the input data twice and output a subset of the input data once.

    my %hash; open file1... while (<$file1>) { ... do something to get the id $hash{$id}++; } close file1 open file 2 while (<$file2>) { ... do something to get the id $hash{$id}+=3; } close file 2 open 3 output files: the both file, file1 only and file2 only loop back thru file1 and file2 for each line decide where it goes if hash value of the id ==1 file one only if hash value of the id ==3 file two only if hash value of the id ==4 both files negate the hash value to signal that this id has already been dealt with - each id should only appear in one of the 3 output files.
    Update:
    To just get the id's of the lines into the 3 different files, it is not necessary to read the input files the second time - just dump the hash according to the "rules". In the above, I first thought that each id would be associated with some non-trivial amount of data - that's not true. The above algorithm will run very fast - no sorting/merging is required.
Re: Help for finding duplicates in huge files
by CountZero (Bishop) on Jul 16, 2011 at 10:24 UTC
    Actually it goes quite fast.

    I first made your two 1.5 million row files to test. The ID has a lenght of 38 characters and is guaranteed to be unique within each file. Some IDs however can be found in both files.

    The following program reads both files and produces three arrays: @both (containing the IDs which were found in both files), @first and @second (IDs only found in one of both files).

    use Modern::Perl; my %first_file; { open my $FIRST, '<', 'first.txt' or die $!; while (<$FIRST>) { chomp; $first_file{$_} = '' ; } } my (@both, @first, @second); { open my $SECOND, '<', 'second.txt' or die $!; while (<$SECOND>) { chomp; if (exists $first_file{$_}) { push @both, $_; delete $first_file{$_}; } else { push @second, $_; } } } @first = keys %first_file;
    On my Windows XP Lenovo X200s laptop running Perl 5.12 it takes 5 seconds to read the first file into the hash and 10 seconds to check the second file against the hash and putting the IDs in their respective arrays.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Help for finding duplicates in huge files
by locked_user sundialsvc4 (Abbot) on Jul 16, 2011 at 11:46 UTC

    With memory sizes as they are today, 1.5 million rows is “just not that interesting.”   If you had data volumes several times larger than that, you could do it like they did it before computers came along:   sort both trays of punched-cards into an identically sorted sequence, then run both trays at the same time through the merge-machine.

    If you are using an SQL database, simply ORDER BY the same key.   When you process n data streams that you know to be identically sorted, you simply select the smallest of the “current records” from each of the n data streams until all streams are at end-of-file.

    External sorting algorithms are among the most heavily studied.   Dr. Knuth called one of his tomes, Sorting and Searching, for a very good reason.

Re: Help for finding duplicates in huge files
by Anonymous Monk on Jul 16, 2011 at 02:12 UTC
    Hi,

    You may already be doing it the most effecient way.

    Show us some code. Or at least tell us what you have tried.

    J.C.

Re: Help for finding duplicates in huge files
by julio_514 (Acolyte) on Jul 16, 2011 at 20:30 UTC

    Guys, thank you so much for your suggestions! Using hash tables works like a charm. I won't post my original script because I would be embarrassed :|

    I apologize, the post title is misleading - it is true that these are not huge files by today's standards.