sesemin has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have got two files like the example below. (one with two columns and one with four columns). I want to find the common elements of the two files if the first and second column of the second file match with the first one and also if third col ==1 and fourth col >=3.

I wrote the following code but it is not very efficient. It takes forever to make comparisons because of too much loops and conditions.

Any suggestion is appreciated.

Pedro

FILE1: CLS_S3_Contig2719-591_592 1 CLS_S3_Contig2720-784_785 1 CLS_S3_Contig2721-139_140 1 CLS_S3_Contig2722-387_388 1 CLS_S3_Contig2724-557_560 2 CLS_S3_Contig2725-465_466 1 CLS_S3_Contig2726-627_650 12 CLPX6160.b1_O03.ab1-229_232 2 CLPX6260.b1_H05.ab1-511_512 1 CLPX627.b1_E14.ab1-373_398 13 CLPX6271.b1_N07.ab1-85_86 1 . . . FILE2 CLS_S3_Contig1000 82 1 0 CLS_S3_Contig1000 83 1 0 CLS_S3_Contig1000 84 1 0 CLS_S3_Contig1000 85 1 0 CLS_S3_Contig1000 86 1 5 CLS_S3_Contig1000 87 1 0 CLS_S3_Contig1000 88 1 0 CLS_S3_Contig1000 89 1 0 CLS_S3_Contig1000 90 1 8 CLS_S3_Contig1000 91 1 0 CLS_S3_Contig1000 92 1 0 CLS_S3_Contig1000 93 0 0 CLS_S3_Contig1000 94 0 0 CLS_S3_Contig1000 95 0 9 CLS_S3_Contig1000 96 0 0 CLS_S3_Contig1000 97 0 0 CLS_S3_Contig1000 98 0 0 CLS_S3_Contig1000 99 1 0 CLS_S3_Contig1000 100 1 0 CLS_S3_Contig1000 101 1 0 CLS_S3_Contig1000 102 1 0 CLS_S3_Contig1000 103 1 3 CLS_S3_Contig1000 104 1 0 CLS_S3_Contig1000 105 1 0 . . .
################################################################ # Read the first file, break the first col to its components # # Expand the last two last numbers e.g. (591_592) plus/minus 8 # # Make a hash of multiple value for each key # # Print the numner of lines read and put into a variable # ################################################################ my %file1=(); while(<INPUT1>){ chomp; (my $id, my $number) = split("\t", $_); if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9] ++)$/i) { my $matched_id=$id; # breaks the CLS_Contig1000_200-202 +to its componenents # and expands the second col plus mi +nus 8 for (my $i=$3-8;$i<$5+8;$i++){ print join ("\t", $1, $i), "\n"; push (@{$file1{$1}}, $i); #make a hash of array } } } # Count the numnber of lines minus header line my $counter_1 = `wc -l < $ARGV[0]`; die "wc failed: $?" if $?; chomp($counter_1); my $counter = $counter_1 -1; #First file has a header row print "$counter lines read from $ARGV[0] file\n"; close(INPUT1); ########################################################### # Reading the Second file # ########################################################### print "Reading the 2nd file\n"; print "It may take a while, please wait...\n"; print "-----------------------------------\n"; while(<INPUT2>){ chomp; my @current_line = split /\t/; foreach my $key (sort keys %file1){ foreach my $position1 (@{$file1{$key}}){ if ($current_line[0] eq $key) { if ($current_line[1] == $position1) { if ($current_line[2] ==1) { if ($current_line[3] >= 3) { print join ("\t", $current_line[0],$current_line[1],$current +_line[2],$current_line[3], "***",$key, $position1), "\n"; } } } } } } } close (INPUT2);

Replies are listed 'Best First'.
Re: Reading two files, cmp certain cols
by jethro (Monsignor) on Sep 19, 2008 at 02:11 UTC

    You created a hash %file1, but then didn't use it as a hash

    # foreach my $key (sort keys %file1){ # foreach my $position1 (@{$file1{$key}}){ # if ($current_line[0] eq $key) { if (exists $file1{current_line[0]} ) { $key= $current_line[0]; foreach my $position1 (@{$file1{$key}}){

    Both versions above should be equivalent (except for the sorting of the keys), but the second version removes a loop.

    The conditions don't contribute much to the runtime, no need to optimize them

      Thanks Jethro,

      I may ask more questions form our Monks as I am trying to complete this script.

      It ran very quick. You are right; the loops had nothing to do with the run time. A+++

        Hi Jethro,

        If I want to add a simple counter to count the number of lines in the output (print statement) do you know what it correct way. I pushed one of the variables into an array and counted used the scalar as a counter. But I think there should be a better way.

        while(<INPUT2>){ chomp; my @current_line = split /\t/; push (@{$file2{$current_line[1]}}, $current_line[2]); if (exists $file1{$current_line[1]} ) { my $key = $current_line[1]; foreach my $position1 (@{$file1{$key}}){ if ($current_line[1] eq $key) { if ($current_line[2] == $position1) { if ($current_line[5] ==1) { if ($current_line[14] >= 3){ print RESULTS join ("\t", $current_line[1],$current_lin +e[2],$current_line[5],$current_line[14], "***",$key, $position1), "\n +"; push (@true_positives, $current_line[1]); } } } } } } }
        I think the problem lies in the 2 loops you have set up Pedro.

        foreach my $position1 (@{$file1{$key}}){

        This array holds the position +/- 8 that you read from the first file. Then, further down in your code, you are also adding plus/minus 8 postions from the position you read from the second file.

        foreach my $pos (@range){

        So I believe that is why you are getting the redundant printout.

        Chris

Re: Reading two files, cmp certain cols
by mscharrer (Hermit) on Sep 19, 2008 at 09:02 UTC
    Hi, could you please use <readmore> tags for code of this length. Also consider to use perltidy to format your code.

    The previous poster already gave you a fix, so I just want to give you a little advice:
    Your construct:

    if ($current_line[0] eq $key) { if ($current_line[1] == $position1) { if ($current_line[2] ==1) { if ($current_line[3] >= 3) { print join ("\t", $current_line[0],$current_line[1],$current +_line[2],$current_line[3], "***",$key, $position1), "\n"; } } } }
    is much tidier written this way:
    if ( $current_line[0] eq $key && $current_line[1] == $position1 && $current_line[2] == 1 && $current_line[3] >= 3) { print join ("\t", @current_line[0..3], "***", $key, $position1), " +\n"; }
Re: Reading two files, cmp certain cols
by Cristoforo (Curate) on Sep 19, 2008 at 16:15 UTC
    I might suggest the reformatted code.

    my %file1; while(<INPUT1>){ chomp; (my $id, my $number) = split /\t/; if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9]+)$/i) + { $file1{$1} = [$3-8, $5+7]; # or should it be $5+8? } } print "Processed $. lines from $ARGV[0] file\n"; close(INPUT1);

    This method only saves the low and high ends of the range instead of each number in the range. Then, in your second while loop, you only need to check if $current_line[1] falls in between the low and high ends of the range. Your code reads each value until it either finds a match or not. You could eliminate all those unnecessary comparisons and just do 1 comparison instead.

    The second while loop could be written like below

    while(<INPUT2>){ chomp; my @current_line = split /\t/; # eliminate unqualified lines early next unless $current_line[2] ==1 && $current_line[3] >= 3; if ($file1{ $current_line[0]}) { my ($lo, $hi) = @{ $file1{ $current_line[0] } }; if ($lo <= $current_line[1] && $current_line[1] <= $hi) { print join("\t", $_, "***", $current_line[1]), "\n"; } } } close (INPUT2);

    In your input sample for file 2, only 3 of 24 lines qualified as valid, (column 3 == 1 and column 4 >= 3). Why not disqualify the 21 out of 24 lines before doing the other check (to see if column 1 is between the low and high ends of the range). That eliminates checking data (in your code) on lines that won't qualify anyway.

    # eliminate unqualified lines early next unless $current_line[2] ==1 && $current_line[3] >= 3;
      Hi Chris, I tried to use your code, first one that reads file1 and make the hash. Does not work very well. I think it does not make hash with multiple values. Because changing the -8 +8 does not change any thing for the $lo or $hi in the second part. Probably a hash of array would work.
      my %file1=(); while(<INPUT1>){ chomp; (my $id, my $number) = split("\t", $_); if ($id=~ m/(^CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9] ++)$/i) { #for (my $i=$3-8; $i<=$5+8; $i++){ # print join ("\t", $1, $i), "\n"; # push (@{$file1{$1}}, $i); $file1{$1} = [$3-8, $5+8]; } } #} foreach (my($k, $v) = (sort keys %file1)){ print "$k\t$v\n"; }
      Results
      CLS_S3_Contig1000 CLS_S3_Contig10000 CLS_S3_Contig1000 CLS_S3_Contig10000
      I should have use DATA::DUMPER Got it
      Thank you very much Cirstoforo for the lessons.

      I will use the tag for long codes in future.

      The reason that I needed to have all values in the first hash (%file1), is that I want to do a series of calculation. such as:

      1- For each key-value in %file1 (each key has multiple values) check and see if the key-value exist in File2 with two conditions Conting_id 2 ==1 and Contig_id 3 >=3. This is the thing that we are doing in second while loop. These are true positives.

      2- Now I want to calculate False Positives which is a little bit trickier. If I have 512 common Current_line[0] between file1 and file2. How many mistakenly in %File1 have positively identified. That is they are in the areas that either current_line2 != 1 or current_line3 is <= 3.

      3- Now False Negatives, how many of current_line[0] and current_line1 (%file1) from file1 have not identified while in the file 2 they have current_line2 == 1 and current_line3 is >= 3.

      4- Also, true negatives. How many of current_line2 == 0 and current_line3 is <= 3 have not truly identified by %file1. I came of the following code that needs to be corrected.

        I will use the tag for long codes in future.
        You do know that you can edit your own nodes, don't you? Just visit the node and edit the contents of the textbox.
      Dear Cristoforo, When I ran the script I realized that I am too stringent about my selection criteria.

      In the second file if $current_line3 >= 3 then the I need to fake around plus/minus 8 of the corresponding position to this line which is ($current_line1). So the condition would be find from file 2 those that $current_line3 >= 3 and it is OK to that ($current_line1)plus minus 8 matches with file 1.

      I thought if I put($current_line1)+/- 8 in an array. then say if from file1 $position1 exists in this array it is OK.

      If I am not clear please let me know to explain it more.

      while(<INPUT2>){ chomp; my @current_line = split /\t/; # eliminate unqualified lines early next unless $current_line[2] == 1 && $current_line[3] >= 3; #$from = $current_line[1]-8; #$to = $current_line[1]+8; # for ($from .. $to){ # push (@range, $_); # } #} if ($file1{ $current_line[0]}) { ($from, $to) = @{ $file1{ $current_line[0] } }; if ($from <= $current_line[1] && $current_line[1] <= $ +to) { print join("\t", $_, "***",$current_line[1]), "\n"; $true_positives++; } } }