Salwrwr has asked for the wisdom of the Perl Monks concerning the following question:

I have 2 huge txt files, the first column is common to both of them, the second file includes more records that the first one, I want to read the first column in the second file element by element and chech in the first column of the first file also element by element to find a match, when a match is found I want to print to a new file only the matched rows from both files, I wrote the following Perl code which works but it is very slow and memory expensive, could anybody help me to find better way or fix my code please:
use strict; use warnings; use Getopt::Std; open (my $newfile, ">", "C:/result.txt"); open (my $fh, "<", "C:/position.txt"); open (my $file, "<", "C:/platform.txt"); my @file_data = <$fh>; my @position = <$file>; close($fh); close($file); foreach my $line (@position) { my @line = split(/\t/,$line); my $start = $line[0]; foreach my $values(@file_data) { my @values = split(/\t/, $values); my $id = $values[0]; if ($start eq $id) { print $newfile $line[0],"\t",$line[2],"\t",$line[3],"\t",$values[0 +],"\t",$values[1],"\t",$values[2],"\n"; } } } print "DONE";

Replies are listed 'Best First'.
Re: match text files
by toolic (Bishop) on Jul 18, 2013 at 17:56 UTC
    • Add code tags: Writeup Formatting Tips
    • I see no need to read position.txt into an array. You can use a while loop instead of the foreach loop. This should easy the memory issue.
    • Perhaps use a hash instead of an array for file_data. This may speed things up. (UPDATE: On 2nd thought, probably not.)

    Can you show us 10 representative lines from each file?

Re: match text files
by mtmcc (Hermit) on Jul 18, 2013 at 18:03 UTC
    Something like this might work:

    #!/usr/bin/perl use strict; use warnings; my $fileOne = $ARGV[0]; my $fileTwo = $ARGV[1]; my $x = 0; my $count = 0; open (my $one, "<", $fileOne); while (<$one>) { $count += 1; } open ($one, "<", $fileOne); open (my $two, "<", $fileTwo); open (my $out, ">", "outputfile.txt"); for ($x = 0; $x < $count; $x += 1) { my $lineOne = <$one>; my $lineTwo = <$two>; print $out "$lineOne" if $lineOne eq $lineTwo; }

Re: match text files
by Anonymous Monk on Jul 18, 2013 at 18:26 UTC
Re: match text files
by hdb (Monsignor) on Jul 18, 2013 at 18:17 UTC

    Here is my outline:

    1. Read one file and split into first element and rest. Create hash with key being the first element.
    2. Read second file line by line. Split into first element and rest.
    3. See whether first element exists in hash. If so print all.
    Not tested as i do not have two fitting files at hand.

    use strict; use warnings; open (my $fh, "<", "C:/position.txt"); my %position = map { split /\t/, $_, 2 } <$fh>; # not sure about this close($fh); open (my $file, "<", "C:/platform.txt"); open (my $newfile, ">", "C:/result.txt"); foreach my $line (<$file>) { my( $start, $rest ) = split /\t/, $line, 2; print $newfile $start, "\t", $position{$start},"\t", $line if exists + $position{$start}; } close($file); print "DONE";
Re: match text files
by rjt (Curate) on Jul 18, 2013 at 20:08 UTC

    Welcome to the monastery, Salwrwr! I'm pretty sure nearly everyone has missed the <code> tags at least once.

    A few general tips, and hopefully not too much overlap with the other replies you've already received:

    • You're not doing any error checking on file operations. Either employ the usual idiom: open my $fh, '<', $filename or die "Can't open $filename: $!";, or use autodie; to obviate the need to code all the checks yourself.
    • You can clean up the print syntax by using join slices: say $newfile join "\t", @line[0,2,3], @values[0..2]; (use 5.010 or later, or use feature 'say' to enable say.
    • There's no need to repeatedly split and join @file_data (or even to create @file_data in the first place; do it outside the @position loop and store the result:
    my @values = map { join "\t", (split /\t/)[0..2] } <$fh>; # Later, in your say statement: say $newfile join "\t", @line[0,2,3], $_ for grep { /^$start\t/ } @val +ues;