sheasbys has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that seems to work OK at finding duplicate records from 2 files based on specific strings found in each file. Initially I am reading the second file into memory and then taking the strings from the first file and comparing them to the file in memory. When there is a match I am writing this to an output file. As the files are about 16MB and have the potential of getting bigger I want to find a more efficient way of handling the second file.

Here is my code:
use File::Copy; my($input_file1) = $ARGV[0]; my($input_file2) = $ARGV[1]; my($output_file) = $ARGV[2]; if ( !defined($input_file1) || !defined($input_file2) || !defined($out +put_file) ) { print "Error: usage: nodups input_file1 input_file2 output_file\n" +; } else { # -----Backup the input files in case of error----- copy( $input_file1, $input_file1 . ".bak" ) or die "Could not backup file 1 $input_file1 to $input_file1.bak: + $!\n"; copy( $input_file2, $input_file2 . ".bak" ) or die "Could not backup file 2 $input_file2 to $input_file2.bak: + $!\n"; # -----Attempt to open all of the files----- open( INFILE1, $input_file1 ) || die( "Could not read input file 1 + ($input_file1): $!" ); open( INFILE2, $input_file2 ) || die( "Could not read input file 2 + ($input_file2): $!" ); open( OUTPUT, "> " . $output_file ) || die( "Could not open output + file ($output_file): $!" ); # -----Read input_file2 into an array so that (later) we can do a +binary search----- @input2 = <INFILE2>; # -----Debug code. Add in if you are experiencing problems. Note t +hat his is used below to print----- # -----out the current line number----- # $linecount = 0; # $outputcount = 0; while (<INFILE1>) { my $line = $_; chomp($line); # -----A line starting with a '2' is a header and is left unch +anged if ( $line !~ m/^2/ ) { foreach $line2 (@input2) { $date = substr( $line, 6, 6 ); $number_dialed = substr( $line, 29, 10 ); $connect_time = substr( $line, 54, 12 ); if ( index( $line2, $date ) != -1 and index( $line2, $ +number_dialed ) != -1 and index( $line2, $connect_time ) != -1 ) { # -----Generate the output string----- $output_line = substr( $line, 0, 6 ) . $date . substr( $line, 12, 17 ) . $number_dialed . substr( $line, 39, 15 ) . $connect_time . substr( $line, 66, 144 ) . " +\n"; print OUTPUT $output_line; # -----Debug code. Add in if you are experiencing +problems----- # print STDOUT "Output " . ++$outputcount . "\n"; # -----If we have found the line, we want to exit +the loop----- last; } } # -----Debug code. Add in if you are experiencing problems +----- # print STDOUT "Line " . ++$linecount . "\n"; } else { print OUTPUT $line . "\n"; } } # -----Close all of the files----- close( INFILE1 ); close( INFILE2 ); close( OUTPUT ); }

Thank you.

Replies are listed 'Best First'.
Re: File Handling for Duplicate Records
by shmem (Chancellor) on Dec 21, 2006 at 22:39 UTC
    That's higly inefficient. For each line read via INFILE1, you iterate over all the lines of the previously read filehandle INFILE2, and you are doing all the substring mumbo jumbo again and again.

    It seems from your if condition, that you are only interested in ($date, $number_dialed, $connect_time).

    A better approach seems to be:

    • read all lines from INFILE2 in a loop and e.g. concatenate them
    • stuff the result in a hash, using the concatenation result as a key
    • then, while reading INFILE1, construct a key in the same manner, and look up this key in the previously built hash.

    If the storage needed for building the hash in the first place (reading INFILE2) surpasses your workstation's memory ressources, store them in e.g. a DB_File or DBD::SQLite database.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: File Handling for Duplicate Records
by Thelonius (Priest) on Dec 22, 2006 at 11:27 UTC
    If you're on a Unix-ish system (having sort and join)--or cygwin on Windows--you can do this with a few lines of shell:
    perl -ne ' if(!/^2/) { $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k|$_" }' file1 | sort -t "|" -k 1,1 >file1.sorted # This code assumes the fields are in the same place in file2 # as they are in file1, but if not, you'll have to change this. perl -ne ' $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k\n" ' file2 | sort -t "|" -k 1,1 >file2.sorted # I am only outputting the key here since you don't seem # to be doing anything with the rest of 'line2' join -t '|' file1.sorted file2.sorted | cut -d '|' -f 2 > duplicates
    With the input of file1:
    3 110582 SFCA 4158675309 041414041421 3 060784 NYNY 2125552368 190159204657 3 121906 RANC 9195551234 123401123620
    and file2:
    3 110582 SFCA 4158675309 041414041421
    your program and mine both produced the output:
    3 110582 SFCA 4158675309 041414041421

    Notes:

    • Make sure you use a delimiter character (I used "|") that's not in the data. You're not limited to printable characters.
    • Strictly speaking, there could be some difference in the output of the two programs. You truncated line1 at 210 characters; I don't. If line1 matches more that one line in file2, I produce multiple lines of output; you only one. Our output is also in a different order.
    • You could save time if you know one of the files is already sorted. For example, maybe file2 doesn't change each run. You can also merge two sorted files using sort -m
    • If you want the lines that are not duplicates, use join -v

    For example, say you have a new file, newdata and a file, alreadyprocessed, which corresponds to my file2.sorted, above. That is, it's just the keys in sorted order. You could do this:

    perl -ne ' if(!/^2/) { $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k|$_" }' newdata | sort -t "|" -k 1,1 >newdata.sorted join -t '|' -v 1 newdata.sorted alreadyprocessed >needsprocessing cut -d '|' -f 2 needsprocessing >processinput # Then do the processing # ... # ... # If everything runs okay cut -d '|' -f 1 needsprocessing | sort -m - alreadyprocessed >mergeout mv alreadyprocessed alreadyprocessed.bak mv mergeout alreadyprocessed

      what about comm? or am I missing something. Of course if file1 needs transforming use perl or whatever filter

      # comm -12 <(sort file1) <(sort file2) > dups.out

      the 'cmd <(cmd1 ...) ...' notation if not supported by your shell means the two-step process 'cmd ... > temp1; cmd1 temp1' cmd1 being a "filter".

      or in other "words" ;)

      % stephan@armen (/home/stephan) % % cat dat1 3 110582 SFCA 4158675309 041414041421 3 060784 NYNY 2125552368 190159204657 3 121906 RANC 9195551234 123401123620 % stephan@armen (/home/stephan) % % cat dat2 3 110582 SFCB 2258675309 041414041421 3 110582 SFCA 4158675309 041414041421 % stephan@armen (/home/stephan) % % sort dat1 > dat1.sorted % stephan@armen (/home/stephan) % % sort dat2 > dat2.sorted % stephan@armen (/home/stephan) % % comm -12 dat1.sorted dat2.sorted 3 110582 SFCA 4158675309 041414041421
      hth --stephan, just another unix hacker,