I have a script that seems to work OK at finding duplicate records from 2 files based on specific strings found in each file. Initially I am reading the second file into memory and then taking the strings from the first file and comparing them to the file in memory. When there is a match I am writing this to an output file. As the files are about 16MB and have the potential of getting bigger I want to find a more efficient way of handling the second file.

Here is my code:
use File::Copy; my($input_file1) = $ARGV[0]; my($input_file2) = $ARGV[1]; my($output_file) = $ARGV[2]; if ( !defined($input_file1) || !defined($input_file2) || !defined($out +put_file) ) { print "Error: usage: nodups input_file1 input_file2 output_file\n" +; } else { # -----Backup the input files in case of error----- copy( $input_file1, $input_file1 . ".bak" ) or die "Could not backup file 1 $input_file1 to $input_file1.bak: + $!\n"; copy( $input_file2, $input_file2 . ".bak" ) or die "Could not backup file 2 $input_file2 to $input_file2.bak: + $!\n"; # -----Attempt to open all of the files----- open( INFILE1, $input_file1 ) || die( "Could not read input file 1 + ($input_file1): $!" ); open( INFILE2, $input_file2 ) || die( "Could not read input file 2 + ($input_file2): $!" ); open( OUTPUT, "> " . $output_file ) || die( "Could not open output + file ($output_file): $!" ); # -----Read input_file2 into an array so that (later) we can do a +binary search----- @input2 = <INFILE2>; # -----Debug code. Add in if you are experiencing problems. Note t +hat his is used below to print----- # -----out the current line number----- # $linecount = 0; # $outputcount = 0; while (<INFILE1>) { my $line = $_; chomp($line); # -----A line starting with a '2' is a header and is left unch +anged if ( $line !~ m/^2/ ) { foreach $line2 (@input2) { $date = substr( $line, 6, 6 ); $number_dialed = substr( $line, 29, 10 ); $connect_time = substr( $line, 54, 12 ); if ( index( $line2, $date ) != -1 and index( $line2, $ +number_dialed ) != -1 and index( $line2, $connect_time ) != -1 ) { # -----Generate the output string----- $output_line = substr( $line, 0, 6 ) . $date . substr( $line, 12, 17 ) . $number_dialed . substr( $line, 39, 15 ) . $connect_time . substr( $line, 66, 144 ) . " +\n"; print OUTPUT $output_line; # -----Debug code. Add in if you are experiencing +problems----- # print STDOUT "Output " . ++$outputcount . "\n"; # -----If we have found the line, we want to exit +the loop----- last; } } # -----Debug code. Add in if you are experiencing problems +----- # print STDOUT "Line " . ++$linecount . "\n"; } else { print OUTPUT $line . "\n"; } } # -----Close all of the files----- close( INFILE1 ); close( INFILE2 ); close( OUTPUT ); }

Thank you.

In reply to File Handling for Duplicate Records by sheasbys

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.