in reply to Reading two files, cmp certain cols

I might suggest the reformatted code.

my %file1; while(<INPUT1>){ chomp; (my $id, my $number) = split /\t/; if ($id=~ m/^(CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9]+)$/i) + { $file1{$1} = [$3-8, $5+7]; # or should it be $5+8? } } print "Processed $. lines from $ARGV[0] file\n"; close(INPUT1);

This method only saves the low and high ends of the range instead of each number in the range. Then, in your second while loop, you only need to check if $current_line[1] falls in between the low and high ends of the range. Your code reads each value until it either finds a match or not. You could eliminate all those unnecessary comparisons and just do 1 comparison instead.

The second while loop could be written like below

while(<INPUT2>){ chomp; my @current_line = split /\t/; # eliminate unqualified lines early next unless $current_line[2] ==1 && $current_line[3] >= 3; if ($file1{ $current_line[0]}) { my ($lo, $hi) = @{ $file1{ $current_line[0] } }; if ($lo <= $current_line[1] && $current_line[1] <= $hi) { print join("\t", $_, "***", $current_line[1]), "\n"; } } } close (INPUT2);

In your input sample for file 2, only 3 of 24 lines qualified as valid, (column 3 == 1 and column 4 >= 3). Why not disqualify the 21 out of 24 lines before doing the other check (to see if column 1 is between the low and high ends of the range). That eliminates checking data (in your code) on lines that won't qualify anyway.

# eliminate unqualified lines early next unless $current_line[2] ==1 && $current_line[3] >= 3;

Replies are listed 'Best First'.
Re^2: Reading two files, cmp certain cols
by sesemin (Beadle) on Sep 21, 2008 at 09:09 UTC
    Hi Chris, I tried to use your code, first one that reads file1 and make the hash. Does not work very well. I think it does not make hash with multiple values. Because changing the -8 +8 does not change any thing for the $lo or $hi in the second part. Probably a hash of array would work.
    my %file1=(); while(<INPUT1>){ chomp; (my $id, my $number) = split("\t", $_); if ($id=~ m/(^CLS_S3_Contig[0-9]+)([-]?)([0-9]+)([_]?)([0-9] ++)$/i) { #for (my $i=$3-8; $i<=$5+8; $i++){ # print join ("\t", $1, $i), "\n"; # push (@{$file1{$1}}, $i); $file1{$1} = [$3-8, $5+8]; } } #} foreach (my($k, $v) = (sort keys %file1)){ print "$k\t$v\n"; }
    Results
    CLS_S3_Contig1000 CLS_S3_Contig10000 CLS_S3_Contig1000 CLS_S3_Contig10000
    I should have use DATA::DUMPER Got it
Re^2: Reading two files, cmp certain cols
by sesemin (Beadle) on Sep 19, 2008 at 18:06 UTC
    Thank you very much Cirstoforo for the lessons.

    I will use the tag for long codes in future.

    The reason that I needed to have all values in the first hash (%file1), is that I want to do a series of calculation. such as:

    1- For each key-value in %file1 (each key has multiple values) check and see if the key-value exist in File2 with two conditions Conting_id 2 ==1 and Contig_id 3 >=3. This is the thing that we are doing in second while loop. These are true positives.

    2- Now I want to calculate False Positives which is a little bit trickier. If I have 512 common Current_line[0] between file1 and file2. How many mistakenly in %File1 have positively identified. That is they are in the areas that either current_line2 != 1 or current_line3 is <= 3.

    3- Now False Negatives, how many of current_line[0] and current_line1 (%file1) from file1 have not identified while in the file 2 they have current_line2 == 1 and current_line3 is >= 3.

    4- Also, true negatives. How many of current_line2 == 0 and current_line3 is <= 3 have not truly identified by %file1. I came of the following code that needs to be corrected.

      I will use the tag for long codes in future.
      You do know that you can edit your own nodes, don't you? Just visit the node and edit the contents of the textbox.
        Thanks for the tip. I thought I have used. Seems not had gone through.
Re^2: Reading two files, cmp certain cols
by sesemin (Beadle) on Sep 20, 2008 at 22:54 UTC
    Dear Cristoforo, When I ran the script I realized that I am too stringent about my selection criteria.

    In the second file if $current_line3 >= 3 then the I need to fake around plus/minus 8 of the corresponding position to this line which is ($current_line1). So the condition would be find from file 2 those that $current_line3 >= 3 and it is OK to that ($current_line1)plus minus 8 matches with file 1.

    I thought if I put($current_line1)+/- 8 in an array. then say if from file1 $position1 exists in this array it is OK.

    If I am not clear please let me know to explain it more.

    while(<INPUT2>){ chomp; my @current_line = split /\t/; # eliminate unqualified lines early next unless $current_line[2] == 1 && $current_line[3] >= 3; #$from = $current_line[1]-8; #$to = $current_line[1]+8; # for ($from .. $to){ # push (@range, $_); # } #} if ($file1{ $current_line[0]}) { ($from, $to) = @{ $file1{ $current_line[0] } }; if ($from <= $current_line[1] && $current_line[1] <= $ +to) { print join("\t", $_, "***",$current_line[1]), "\n"; $true_positives++; } } }