comment on

I have a script that seems to work OK at finding duplicate records from 2 files based on specific strings found in each file. Initially I am reading the second file into memory and then taking the strings from the first file and comparing them to the file in memory. When there is a match I am writing this to an output file. As the files are about 16MB and have the potential of getting bigger I want to find a more efficient way of handling the second file.

Here is my code:

use File::Copy;

my($input_file1) = $ARGV[0];
my($input_file2) = $ARGV[1];
my($output_file) = $ARGV[2];

if ( !defined($input_file1) || !defined($input_file2) || !defined($out
+put_file) ) {
    print "Error: usage: nodups input_file1 input_file2 output_file\n"
+;
}
else {

    # -----Backup the input files in case of error-----
    copy( $input_file1, $input_file1 . ".bak" ) or
        die "Could not backup file 1 $input_file1 to $input_file1.bak:
+ $!\n";
    copy( $input_file2, $input_file2 . ".bak" ) or
        die "Could not backup file 2 $input_file2 to $input_file2.bak:
+ $!\n";

    # -----Attempt to open all of the files-----
    open( INFILE1, $input_file1 ) || die( "Could not read input file 1
+ ($input_file1): $!" );
    open( INFILE2, $input_file2 ) || die( "Could not read input file 2
+ ($input_file2): $!" );
    open( OUTPUT, "> " . $output_file ) || die( "Could not open output
+ file ($output_file): $!" );

    # -----Read input_file2 into an array so that (later) we can do a 
+binary search-----
    @input2 = <INFILE2>;

    # -----Debug code. Add in if you are experiencing problems. Note t
+hat his is used below to print-----
    # -----out the current line number-----
    # $linecount = 0;
    # $outputcount = 0;

    while (<INFILE1>) {
        my $line = $_;
        chomp($line);

        # -----A line starting with a '2' is a header and is left unch
+anged
        if ( $line !~ m/^2/ ) {

            foreach $line2 (@input2) {
                $date = substr( $line, 6, 6 );
                $number_dialed = substr( $line, 29, 10 );
                $connect_time = substr( $line, 54, 12 );

                if ( index( $line2, $date ) != -1 and index( $line2, $
+number_dialed ) != -1 and index( $line2, $connect_time ) != -1 ) {
                    
                    # -----Generate the output string-----
                    $output_line = substr( $line, 0, 6 ) 
                        . $date . substr( $line, 12, 17 )
                        . $number_dialed . substr( $line, 39, 15 )
                        . $connect_time . substr( $line, 66, 144 ) . "
+\n";

                    print OUTPUT $output_line;
                    # -----Debug code. Add in if you are experiencing 
+problems-----
                    # print STDOUT "Output " . ++$outputcount . "\n";

                    # -----If we have found the line, we want to exit 
+the loop-----
                    last;
                }
            }
            # -----Debug code. Add in if you are experiencing problems
+-----
            # print STDOUT "Line " . ++$linecount . "\n";
        }
        else {
            print OUTPUT $line . "\n";
        }
    }

    # -----Close all of the files-----
    close( INFILE1 );
    close( INFILE2 );
    close( OUTPUT );
}
[download]

Thank you.

In reply to File Handling for Duplicate Records by sheasbys

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.