Reading two files, cmp certain cols

sesemin has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have got two files like the example below. (one with two columns and one with four columns). I want to find the common elements of the two files if the first and second column of the second file match with the first one and also if third col ==1 and fourth col >=3.

I wrote the following code but it is not very efficient. It takes forever to make comparisons because of too much loops and conditions.

Any suggestion is appreciated.

Pedro


FILE1:

CLS_S3_Contig2719-591_592       1
CLS_S3_Contig2720-784_785       1
CLS_S3_Contig2721-139_140       1
CLS_S3_Contig2722-387_388       1
CLS_S3_Contig2724-557_560       2
CLS_S3_Contig2725-465_466       1
CLS_S3_Contig2726-627_650       12
CLPX6160.b1_O03.ab1-229_232     2
CLPX6260.b1_H05.ab1-511_512     1
CLPX627.b1_E14.ab1-373_398      13
CLPX6271.b1_N07.ab1-85_86       1

.
.
.

FILE2

CLS_S3_Contig1000    82    1    0
CLS_S3_Contig1000    83    1    0
CLS_S3_Contig1000    84    1    0
CLS_S3_Contig1000    85    1    0
CLS_S3_Contig1000    86    1    5
CLS_S3_Contig1000    87    1    0
CLS_S3_Contig1000    88    1    0
CLS_S3_Contig1000    89    1    0
CLS_S3_Contig1000    90    1    8
CLS_S3_Contig1000    91    1    0
CLS_S3_Contig1000    92    1    0
CLS_S3_Contig1000    93    0    0
CLS_S3_Contig1000    94    0    0
CLS_S3_Contig1000    95    0    9
CLS_S3_Contig1000    96    0    0
CLS_S3_Contig1000    97    0    0
CLS_S3_Contig1000    98    0    0
CLS_S3_Contig1000    99    1    0
CLS_S3_Contig1000    100    1    0
CLS_S3_Contig1000    101    1    0
CLS_S3_Contig1000    102    1    0
CLS_S3_Contig1000    103    1    3
CLS_S3_Contig1000    104    1    0
CLS_S3_Contig1000    105    1    0
.
.
.
[download]

################################################################
# Read the first file, break the first col to its components   #
# Expand the last two last numbers e.g. (591_592) plus/minus 8 #      
# Make a hash of multiple value for each key                   #
# Print the numner of lines read and put into a variable       #
################################################################

my %file1=();
while(<INPUT1>){
         chomp;
         (my $id, my $number) = split("\t", $_);

          if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9]
++)$/i) {

              my $matched_id=$id; # breaks the CLS_Contig1000_200-202 
+to its componenents
                                  # and expands the second col plus mi
+nus 8 
                  for (my $i=$3-8;$i<$5+8;$i++){
              print join ("\t", $1, $i), "\n";
              push (@{$file1{$1}}, $i); #make a hash of array
        }
   }
}

           # Count the numnber of lines minus header line
           
           my $counter_1 = `wc -l < $ARGV[0]`;
           die "wc failed: $?"
           if $?;
           chomp($counter_1);
           my $counter = $counter_1 -1; #First file has a header row 
           print "$counter lines read from  $ARGV[0] file\n";

close(INPUT1);


###########################################################
#                    Reading the Second file              #
###########################################################

print "Reading the 2nd file\n";
print "It may take a while, please wait...\n";
print "-----------------------------------\n";



while(<INPUT2>){
         chomp;
         my @current_line  = split /\t/;

        foreach my $key (sort keys %file1){
                 foreach my $position1 (@{$file1{$key}}){
            if ($current_line[0] eq $key) {
              if ($current_line[1] == $position1) {
                  if ($current_line[2] ==1) {
                      if ($current_line[3] >= 3) {

          print join ("\t", $current_line[0],$current_line[1],$current
+_line[2],$current_line[3], "***",$key, $position1), "\n";
                   }
                }
              }
           }
        }
    }
}

close (INPUT2);
[download]

Comment on Reading two files, cmp certain cols Select or Download Code

Replies are listed 'Best First'.
Re: Reading two files, cmp certain cols by jethro (Monsignor) on Sep 19, 2008 at 02:11 UTC
You created a hash %file1, but then didn't use it as a hash `# foreach my $key (sort keys %file1){ # foreach my $position1 (@{$file1{$key}}){ # if ($current_line[0] eq $key) { if (exists $file1{current_line[0]} ) { $key= $current_line[0]; foreach my $position1 (@{$file1{$key}}){` [download] Both versions above should be equivalent (except for the sorting of the keys), but the second version removes a loop. The conditions don't contribute much to the runtime, no need to optimize them	[reply] [d/l]
Re^2: Reading two files, cmp certain cols by sesemin (Beadle) on Sep 19, 2008 at 02:27 UTC
Thanks Jethro, I may ask more questions form our Monks as I am trying to complete this script. It ran very quick. You are right; the loops had nothing to do with the run time. A+++	[reply]
Re^3: Reading two files, cmp certain cols by sesemin (Beadle) on Sep 19, 2008 at 16:40 UTC
Hi Jethro, If I want to add a simple counter to count the number of lines in the output (print statement) do you know what it correct way. I pushed one of the variables into an array and counted used the scalar as a counter. But I think there should be a better way. while(<INPUT2>){ chomp; my @current_line = split /\t/; push (@{$file2{$current_line[1]}}, $current_line[2]); if (exists $file1{$current_line[1]} ) { my $key = $current_line[1]; foreach my $position1 (@{$file1{$key}}){ if ($current_line[1] eq $key) { if ($current_line[2] == $position1) { if ($current_line[5] ==1) { if ($current_line[14] >= 3){ print RESULTS join ("\t", $current_line[1],$current_lin +e[2],$current_line[5],$current_line[14], "***",$key, $position1), "\n +"; push (@true_positives, $current_line[1]); } } } } } } } [download]	[reply] [d/l]
Re^4: Reading two files, cmp certain cols by jethro (Monsignor) on Sep 19, 2008 at 23:39 UTC
Re^5: Reading two files, cmp certain cols by sesemin (Beadle) on Sep 20, 2008 at 20:36 UTC
Some notes below your chosen depth have not been shown here
Re^3: Reading two files, cmp certain cols by Cristoforo (Curate) on Sep 20, 2008 at 23:17 UTC
I think the problem lies in the 2 loops you have set up Pedro. `foreach my $position1 (@{$file1{$key}}){` This array holds the position +/- 8 that you read from the first file. Then, further down in your code, you are also adding plus/minus 8 postions from the position you read from the second file. `foreach my $pos (@range){` So I believe that is why you are getting the redundant printout. Chris	[reply] [d/l] [select]
Re: Reading two files, cmp certain cols by mscharrer (Hermit) on Sep 19, 2008 at 09:02 UTC
Hi, could you please use `<readmore>` tags for code of this length. Also consider to use `perltidy` to format your code. The previous poster already gave you a fix, so I just want to give you a little advice: Your construct: `if ($current_line[0] eq $key) { if ($current_line[1] == $position1) { if ($current_line[2] ==1) { if ($current_line[3] >= 3) { print join ("\t", $current_line[0],$current_line[1],$current +_line[2],$current_line[3], "*",$key, $position1), "\n"; } } } }` [download] is much tidier written this way: `if ( $current_line[0] eq $key && $current_line[1] == $position1 && $current_line[2] == 1 && $current_line[3] >= 3) { print join ("\t", @current_line[0..3], "*", $key, $position1), " +\n"; }` [download]	[reply] [d/l] [select]
Re: Reading two files, cmp certain cols by Cristoforo (Curate) on Sep 19, 2008 at 16:15 UTC
I might suggest the reformatted code. `my %file1; while(<INPUT1>){ chomp; (my $id, my $number) = split /\t/; if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9]+)$/i) + { $file1{$1} = [$3-8, $5+7]; # or should it be $5+8? } } print "Processed $. lines from $ARGV[0] file\n"; close(INPUT1);` [download] This method only saves the low and high ends of the range instead of each number in the range. Then, in your second `while` loop, you only need to check if `$current_line[1]` falls in between the low and high ends of the range. Your code reads each value until it either finds a match or not. You could eliminate all those unnecessary comparisons and just do 1 comparison instead. The second while loop could be written like below `while(<INPUT2>){ chomp; my @current_line = split /\t/; # eliminate unqualified lines early next unless $current_line[2] ==1 && $current_line[3] >= 3; if ($file1{ $current_line[0]}) { my ($lo, $hi) = @{ $file1{ $current_line[0] } }; if ($lo <= $current_line[1] && $current_line[1] <= $hi) { print join("\t", $_, "***", $current_line[1]), "\n"; } } } close (INPUT2);` [download] In your input sample for file 2, only 3 of 24 lines qualified as valid, (column 3 == 1 and column 4 >= 3). Why not disqualify the 21 out of 24 lines before doing the other check (to see if column 1 is between the low and high ends of the range). That eliminates checking data (in your code) on lines that won't qualify anyway. `# eliminate unqualified lines early next unless $current_line[2] ==1 && $current_line[3] >= 3;` [download]	[reply] [d/l] [select]
Re^2: Reading two files, cmp certain cols by sesemin (Beadle) on Sep 21, 2008 at 09:09 UTC
Hi Chris, I tried to use your code, first one that reads file1 and make the hash. Does not work very well. I think it does not make hash with multiple values. Because changing the -8 +8 does not change any thing for the $lo or $hi in the second part. Probably a hash of array would work. `my %file1=(); while(<INPUT1>){ chomp; (my $id, my $number) = split("\t", $_); if ($id=~ m/(^CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9] ++)$/i) { #for (my $i=$3-8; $i<=$5+8; $i++){ # print join ("\t", $1, $i), "\n"; # push (@{$file1{$1}}, $i); $file1{$1} = [$3-8, $5+8]; } } #} foreach (my($k, $v) = (sort keys %file1)){ print "$k\t$v\n"; }` [download] Results `CLS_S3_Contig1000 CLS_S3_Contig10000 CLS_S3_Contig1000 CLS_S3_Contig10000` [download] I should have use DATA::DUMPER Got it	[reply] [d/l] [select]
Re^2: Reading two files, cmp certain cols by sesemin (Beadle) on Sep 19, 2008 at 18:06 UTC
Thank you very much Cirstoforo for the lessons. I will use the tag for long codes in future. The reason that I needed to have all values in the first hash (%file1), is that I want to do a series of calculation. such as: 1- For each key-value in %file1 (each key has multiple values) check and see if the key-value exist in File2 with two conditions Conting_id 2 ==1 and Contig_id 3 >=3. This is the thing that we are doing in second while loop. These are true positives. 2- Now I want to calculate False Positives which is a little bit trickier. If I have 512 common Current_line[0] between file1 and file2. How many mistakenly in %File1 have positively identified. That is they are in the areas that either current_line2 != 1 or current_line3 is <= 3. 3- Now False Negatives, how many of current_line[0] and current_line1 (%file1) from file1 have not identified while in the file 2 they have current_line2 == 1 and current_line3 is >= 3. 4- Also, true negatives. How many of current_line2 == 0 and current_line3 is <= 3 have not truly identified by %file1. I came of the following code that needs to be corrected. Read more... (3 kB)	[reply] [d/l]
Re^3: Reading two files, cmp certain cols by FunkyMonk (Bishop) on Sep 19, 2008 at 20:46 UTC
I will use the tag for long codes in future. You do know that you can edit your own nodes, don't you? Just visit the node and edit the contents of the textbox.	[reply]
Re^4: Reading two files, cmp certain cols by sesemin (Beadle) on Sep 20, 2008 at 05:43 UTC
Re^2: Reading two files, cmp certain cols by sesemin (Beadle) on Sep 20, 2008 at 22:54 UTC
Dear Cristoforo, When I ran the script I realized that I am too stringent about my selection criteria. In the second file if $current_line3 >= 3 then the I need to fake around plus/minus 8 of the corresponding position to this line which is ($current_line1). So the condition would be find from file 2 those that $current_line3 >= 3 and it is OK to that ($current_line1)plus minus 8 matches with file 1. I thought if I put($current_line1)+/- 8 in an array. then say if from file1 $position1 exists in this array it is OK. If I am not clear please let me know to explain it more. `while(<INPUT2>){ chomp; my @current_line = split /\t/; # eliminate unqualified lines early next unless $current_line[2] == 1 && $current_line[3] >= 3; #$from = $current_line[1]-8; #$to = $current_line[1]+8; # for ($from .. $to){ # push (@range, $_); # } #} if ($file1{ $current_line[0]}) { ($from, $to) = @{ $file1{ $current_line[0] } }; if ($from <= $current_line[1] && $current_line[1] <= $ +to) { print join("\t", $_, "***",$current_line[1]), "\n"; $true_positives++; } } }` [download]	[reply] [d/l]