ramouz87 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm new in perl and would be intrested to have suggestion from experienced people on how to write optimized code for the following task:
I work with genomic data filesize higher than 1Go and have 2 files with lines that looks like this:
** A line from file1:
HWI-EA332_91026_1_1_7_586#TCTTAT/1 + Chr3 67121130 TATTNTAAGTCTATGTTGGGGGGGTGGTCATTGAATGTAAGNTGGGTCTC
** A matching line from file2:
HWI-EA332_91026_1_1_7_586#ACATAA/2 - Chr7 127074854 AAAATAAAGCTNATCTGGAAGCAACAGTANGAAGCAGAAGACTGNACACC
The id is the subsring EA332_91026_1_1_6_683 identified by  my @token = split('\-|\#', $line);
What i want to do is:
for each id in file 1 look if same id exist in file 2 and when there's a match:
* I add the matching line from file 1 followed by the matching line from file 2 to a new file (matched_pair)
* based on a tab delimited split of the line take column 2/3 from the matching line in file1 and file2 and the absolute value of the difference between column 4 in file1/file2 (gap_dist) and the id.
Based on the example above we expect to have this line :
EA332_91026_1_1_7_586 + Chr3 - Chr7 59953724
Thanks in advance for your kind help.
Regards,
Ramzi

Replies are listed 'Best First'.
Re: generating merged data from matching id in 2 files
by wfsp (Abbot) on Nov 19, 2009 at 11:13 UTC
    If the smaller of your two files will fit into memory then load that into a hash with the id as the key. Then use a while loop to loop over the other file checking each id against the hash.

    If neither will fit in memory then perhaps consider loading one of the files into a db. DBM::Deep, for example, is particularly suited to _big_ look ups. Then proceed as above.

    Write some code, see how you get on and, if you get stuck show what you have and we'll help you with it. This way we'll have a better idea of the sort of help you need.

    Good luck!

Re: generating merged data from matching id in 2 files
by stefbv (Priest) on Nov 19, 2009 at 13:57 UTC

    Here is a quick hack, just something to start with, based on the algorithm posted by wfsp of processing each file only once.

    Note that I don't have experience with large text files and the code is not optimized, (but I hope to learn something from this to).

    use strict; use warnings; use Tie::File; use Fcntl qw(O_RDONLY); my $file1 = 'file1.txt'; my $file2 = 'file2.txt'; my $file3 = 'file3.txt'; my %result; #-- Process file 1 tie my @file1_arr, 'Tie::File', $file1, mode => O_RDONLY; foreach my $record ( @file1_arr) { my @id = $record =~ m/HWI\-(.*)\#/g; # extract ID's my @rez = split(/\t/, $record, 4); # Save some info for later $result{$id[0]} = [ @rez[1,2] ]; # assume is only 1 ID :) } untie @file1_arr; # finished with file 1 #-- Process file 2 and write output to file 3 tie my @file2_arr, 'Tie::File', $file2, mode => O_RDONLY; #-- The result file tie my @content, 'Tie::File', $file3; foreach my $record ( @file2_arr) { my @id = $record =~ m/HWI\-(.*)\#/g; my $data2 = $result{$id[0]}[0]; #print " D2 $data2\n"; my $data3 = $result{$id[0]}[1]; #print " D3 $data3\n"; # Output my @rez = split(/\t/, $record, 4); my $record_new = "$id[0] $data2 $rez[1] " . abs($data3 - $rez[2]); push @content, $record_new; } untie @file2_arr; # finished with file 2 untie @content; # all finished

      I think that using Tie::File in this particular case (when sequentially iterating over the records) doesn't provide any advantage over simply reading line by line from a normal file handle (rather, it's likely to be slower, due to the much more complex things going on under the hood...).

      That is,

      tie my @file1_arr, 'Tie::File', $file1, mode => O_RDONLY; foreach my $record ( @file1_arr) { # ... } untie @file1_arr; # finished with file 1

      would become

      open my $fh, "<", $file1 or die "Couldn't open '$file1': $!"; while (my $record = <$fh>) { # ... } close $fh;

      (ditto for $file2)