match text files

Salwrwr has asked for the wisdom of the Perl Monks concerning the following question:

I have 2 huge txt files, the first column is common to both of them, the second file includes more records that the first one, I want to read the first column in the second file element by element and chech in the first column of the first file also element by element to find a match, when a match is found I want to print to a new file only the matched rows from both files, I wrote the following Perl code which works but it is very slow and memory expensive, could anybody help me to find better way or fix my code please:

    use strict;
    use warnings;
    use Getopt::Std;
    open (my $newfile, ">", "C:/result.txt");
    open (my $fh, "<", "C:/position.txt");
    open (my $file, "<", "C:/platform.txt");
    my @file_data = <$fh>;
    my @position = <$file>;
    close($fh);
    close($file);
    foreach my $line (@position)
    {
    my @line = split(/\t/,$line);
    my $start = $line[0];
    foreach my $values(@file_data)
    {
    my @values = split(/\t/, $values);
    my $id = $values[0];
    if ($start eq $id)
    {
    print $newfile $line[0],"\t",$line[2],"\t",$line[3],"\t",$values[0
+],"\t",$values[1],"\t",$values[2],"\n";
    }
    }
    }

print "DONE";
[download]

Comment on match text files Download Code

Replies are listed 'Best First'.

Re: match text files
by toolic (Bishop) on Jul 18, 2013 at 17:56 UTC

Add code tags: Writeup Formatting Tips
I see no need to read position.txt into an array. You can use a while loop instead of the foreach loop. This should easy the memory issue.
Perhaps use a hash instead of an array for file_data. This may speed things up. (UPDATE: On 2nd thought, probably not.)

Can you show us 10 representative lines from each file?

[reply]

Re: match text files
by mtmcc (Hermit) on Jul 18, 2013 at 18:03 UTC

#!/usr/bin/perl
use strict;
use warnings;

my $fileOne = $ARGV[0];
my $fileTwo = $ARGV[1];
my $x = 0;


my $count = 0;
open (my $one, "<", $fileOne);
while (<$one>)
    {
        $count += 1;
    } 

open ($one, "<", $fileOne);
open (my $two, "<", $fileTwo);
open (my $out, ">", "outputfile.txt");

for ($x = 0; $x < $count; $x += 1)
    {
        my $lineOne = <$one>;
        my $lineTwo = <$two>;

        print $out "$lineOne" if $lineOne eq $lineTwo;
    }
[download]

[reply]
[d/l]

Re: match text files
by Anonymous Monk on Jul 18, 2013 at 18:26 UTC

join - join two files according to a common key
part - split up files according to column value
Merge two files with similar column entries
Opening multiple files ( csvpaste.pl )
Open multiple file handles?
X, Y Table structure
How to extract the particular residues from PDB files(text csv split hash bioperl)
Select only desired features from a text (text csv split hash bioperl)
Sort on Table headers (text csv split hash bioperl)
I need help joining tab-delimited files/tables!
split then join based on common value in field
Smart way to read a file vertically?
reformatting tab delimited file
edit a CSV and "in-place" replacement
Re: Reversing a mysql table, transpose-tsv -- invert a tab-delimited table

Text::CSV

examples/csv2xls Script to onvert CSV files to M$Excel

examples/csv-check Script to check a CSV file/stream

examples/csvdiff Script to shoff diff between sorted CSV files

examples/parser-xs.pl Parse CSV stream, be forgiving on bad lines

[CSV hash ] CSV hash: Best way to match a hash with large CSV file; perl hash to CSV using Text::CSV_XS; Issue parsing CSV into hashes?; Veriable Length Array/Hash derived from CSV to populate an XML; extracting data from CSV files and making hash of hashes; Re^2: build hash from csv file; Encoding a hash in perl before saving it as a CSV file; hash from CSV-like structure; Read the csv file to a hash....; Parsing CSV into a hash; build hash from csv file; Converting a CSV list to a list of hashrefs naming the fields

Converting File Delimiters (csv to tsv to psv, use App::CCSV)

[reply]
[d/l]

Re: match text files
by hdb (Monsignor) on Jul 18, 2013 at 18:17 UTC

Here is my outline:

Read one file and split into first element and rest. Create hash with key being the first element.
Read second file line by line. Split into first element and rest.
See whether first element exists in hash. If so print all.

use strict; 
use warnings; 

open (my $fh, "<", "C:/position.txt"); 
my %position = map { split /\t/, $_, 2 } <$fh>; # not sure about this
close($fh); 

open (my $file, "<", "C:/platform.txt"); 
open (my $newfile, ">", "C:/result.txt"); 
foreach my $line (<$file>) { 
  my( $start, $rest ) = split /\t/, $line, 2; 
  print $newfile $start, "\t", $position{$start},"\t", $line if exists
+ $position{$start};
} 
close($file); 
print "DONE";
[download]

[reply]
[d/l]

Re: match text files
by rjt (Curate) on Jul 18, 2013 at 20:08 UTC

Welcome to the monastery, Salwrwr! I'm pretty sure nearly everyone has missed the <code> tags at least once.

A few general tips, and hopefully not too much overlap with the other replies you've already received:

You're not doing any error checking on file operations. Either employ the usual idiom: open my $fh, '<', $filename or die "Can't open $filename: $!";, or use autodie; to obviate the need to code all the checks yourself.
You can clean up the print syntax by using join slices: say $newfile join "\t", @line[0,2,3], @values[0..2]; (use 5.010 or later, or use feature 'say' to enable say.
There's no need to repeatedly split and join @file_data (or even to create @file_data in the first place; do it outside the @position loop and store the result:

my @values = map { join "\t", (split /\t/)[0..2] } <$fh>;

# Later, in your say statement:
say $newfile join "\t", @line[0,2,3], $_ for grep { /^$start\t/ } @val
+ues;
[download]

[reply]
[d/l]
[select]