File Handling for Duplicate Records

sheasbys has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that seems to work OK at finding duplicate records from 2 files based on specific strings found in each file. Initially I am reading the second file into memory and then taking the strings from the first file and comparing them to the file in memory. When there is a match I am writing this to an output file. As the files are about 16MB and have the potential of getting bigger I want to find a more efficient way of handling the second file.

Here is my code:

use File::Copy;

my($input_file1) = $ARGV[0];
my($input_file2) = $ARGV[1];
my($output_file) = $ARGV[2];

if ( !defined($input_file1) || !defined($input_file2) || !defined($out
+put_file) ) {
    print "Error: usage: nodups input_file1 input_file2 output_file\n"
+;
}
else {

    # -----Backup the input files in case of error-----
    copy( $input_file1, $input_file1 . ".bak" ) or
        die "Could not backup file 1 $input_file1 to $input_file1.bak:
+ $!\n";
    copy( $input_file2, $input_file2 . ".bak" ) or
        die "Could not backup file 2 $input_file2 to $input_file2.bak:
+ $!\n";

    # -----Attempt to open all of the files-----
    open( INFILE1, $input_file1 ) || die( "Could not read input file 1
+ ($input_file1): $!" );
    open( INFILE2, $input_file2 ) || die( "Could not read input file 2
+ ($input_file2): $!" );
    open( OUTPUT, "> " . $output_file ) || die( "Could not open output
+ file ($output_file): $!" );

    # -----Read input_file2 into an array so that (later) we can do a 
+binary search-----
    @input2 = <INFILE2>;

    # -----Debug code. Add in if you are experiencing problems. Note t
+hat his is used below to print-----
    # -----out the current line number-----
    # $linecount = 0;
    # $outputcount = 0;

    while (<INFILE1>) {
        my $line = $_;
        chomp($line);

        # -----A line starting with a '2' is a header and is left unch
+anged
        if ( $line !~ m/^2/ ) {

            foreach $line2 (@input2) {
                $date = substr( $line, 6, 6 );
                $number_dialed = substr( $line, 29, 10 );
                $connect_time = substr( $line, 54, 12 );

                if ( index( $line2, $date ) != -1 and index( $line2, $
+number_dialed ) != -1 and index( $line2, $connect_time ) != -1 ) {
                    
                    # -----Generate the output string-----
                    $output_line = substr( $line, 0, 6 ) 
                        . $date . substr( $line, 12, 17 )
                        . $number_dialed . substr( $line, 39, 15 )
                        . $connect_time . substr( $line, 66, 144 ) . "
+\n";

                    print OUTPUT $output_line;
                    # -----Debug code. Add in if you are experiencing 
+problems-----
                    # print STDOUT "Output " . ++$outputcount . "\n";

                    # -----If we have found the line, we want to exit 
+the loop-----
                    last;
                }
            }
            # -----Debug code. Add in if you are experiencing problems
+-----
            # print STDOUT "Line " . ++$linecount . "\n";
        }
        else {
            print OUTPUT $line . "\n";
        }
    }

    # -----Close all of the files-----
    close( INFILE1 );
    close( INFILE2 );
    close( OUTPUT );
}
[download]

Thank you.

Comment on File Handling for Duplicate Records Download Code

Replies are listed 'Best First'.
Re: File Handling for Duplicate Records by shmem (Chancellor) on Dec 21, 2006 at 22:39 UTC
That's higly inefficient. For each line read via `INFILE1`, you iterate over all the lines of the previously read filehandle `INFILE2`, and you are doing all the substring mumbo jumbo again and again. It seems from your `if` condition, that you are only interested in `($date, $number_dialed, $connect_time)`. A better approach seems to be: read all lines from `INFILE2` in a loop and e.g. concatenate them stuff the result in a hash, using the concatenation result as a key then, while reading `INFILE1`, construct a key in the same manner, and look up this key in the previously built hash. If the storage needed for building the hash in the first place (reading `INFILE2`) surpasses your workstation's memory ressources, store them in e.g. a DB_File or DBD::SQLite database. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re: File Handling for Duplicate Records by Thelonius (Priest) on Dec 22, 2006 at 11:27 UTC
If you're on a Unix-ish system (having `sort` and `join`)--or cygwin on Windows--you can do this with a few lines of shell: perl -ne ' if(!/^2/) { $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k\|$_" }' file1 \| sort -t "\|" -k 1,1 >file1.sorted # This code assumes the fields are in the same place in file2 # as they are in file1, but if not, you'll have to change this. perl -ne ' $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k\n" ' file2 \| sort -t "\|" -k 1,1 >file2.sorted # I am only outputting the key here since you don't seem # to be doing anything with the rest of 'line2' join -t '\|' file1.sorted file2.sorted \| cut -d '\|' -f 2 > duplicates [download] With the input of file1: `3 110582 SFCA 4158675309 041414041421 3 060784 NYNY 2125552368 190159204657 3 121906 RANC 9195551234 123401123620` [download] and file2: `3 110582 SFCA 4158675309 041414041421` [download] your program and mine both produced the output: `3 110582 SFCA 4158675309 041414041421` [download] Notes: Make sure you use a delimiter character (I used "\|") that's not in the data. You're not limited to printable characters. Strictly speaking, there could be some difference in the output of the two programs. You truncated line1 at 210 characters; I don't. If line1 matches more that one line in file2, I produce multiple lines of output; you only one. Our output is also in a different order. You could save time if you know one of the files is already sorted. For example, maybe file2 doesn't change each run. You can also merge two sorted files using `sort -m` If you want the lines that are not duplicates, use `join -v` For example, say you have a new file, `newdata` and a file, `alreadyprocessed`, which corresponds to my `file2.sorted`, above. That is, it's just the keys in sorted order. You could do this: `perl -ne ' if(!/^2/) { $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k\|$_" }' newdata \| sort -t "\|" -k 1,1 >newdata.sorted join -t '\|' -v 1 newdata.sorted alreadyprocessed >needsprocessing cut -d '\|' -f 2 needsprocessing >processinput # Then do the processing # ... # ... # If everything runs okay cut -d '\|' -f 1 needsprocessing \| sort -m - alreadyprocessed >mergeout mv alreadyprocessed alreadyprocessed.bak mv mergeout alreadyprocessed` [download]	[reply] [d/l] [select]
Re^2: File Handling for Duplicate Records by sgt (Deacon) on Dec 22, 2006 at 15:31 UTC
what about comm? or am I missing something. Of course if file1 needs transforming use perl or whatever filter # comm -12 <(sort file1) <(sort file2) > dups.out the 'cmd <(cmd1 ...) ...' notation if not supported by your shell means the two-step process 'cmd ... > temp1; cmd1 temp1' cmd1 being a "filter". or in other "words" ;) % stephan@armen (/home/stephan) % % cat dat1 3 110582 SFCA 4158675309 041414041421 3 060784 NYNY 2125552368 190159204657 3 121906 RANC 9195551234 123401123620 % stephan@armen (/home/stephan) % % cat dat2 3 110582 SFCB 2258675309 041414041421 3 110582 SFCA 4158675309 041414041421 % stephan@armen (/home/stephan) % % sort dat1 > dat1.sorted % stephan@armen (/home/stephan) % % sort dat2 > dat2.sorted % stephan@armen (/home/stephan) % % comm -12 dat1.sorted dat2.sorted 3 110582 SFCA 4158675309 041414041421 [download] hth --stephan, just another unix hacker,	[reply] [d/l]