Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a sequence file seq.txt which gives each sequence of length 150bp at the specific position.
seq.txt position range seq 46 1..1524 ----------------- 832 1..1524 ----------------- 1008 1..1524 ----------------- 1407 1..1524 ----------------- 2360 2052..3260 ------------------ 2967 2052..3260 ------------------ 403 1..1524 ----------------- 800 1..1524 ----------------- 2986 2052..3260 ------------------ 3170 2052..3260 ------------------
I want to find out how much sequence is covered within a particular range. For example range 1...1524 is reported with 6 positions 46, 403, 800, 832, 1008, 1407 where there is an overlap between 800 and 832. The output should give (150+150+32+118+32+150+150) = 782 With my script I am trying to create an array which stores 1 to 1524. Making the array 1 if it contains the position+150 else the array is set to 0. But there is a problem with the loop I think which is not giving the correct output. Any help or idea will be appreciated. My script is:
use strict; use warnings; my $i; my $j; my $seq; open my $m, '<', 'count_try.txt' or die 'Cannot open count_try.txt'; while ($seq = <$m>) { chomp $seq; my (@range) = split(/\t/, $seq); my ($fm, $to) = split(/\.\./, $range[1]); my $r = $range[0]; for ($i = $fm; $i <= $to; $i++) { $j = $r + 150; if ($i >= $r and $i <= $j) {$range[$i] = 1;} else {$range[$i] = 0;} print $range[$i]."\t"; } } close $m;

Replies are listed 'Best First'.
Re: find the coverage of sequence in a particular range
by choroba (Cardinal) on May 20, 2016 at 12:47 UTC
    The following script gives the output you want. I'm not sure it does it in the desired way, though, as you haven't described how exactly to get the value.

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my ($START, $END, $SIZE) = (1, 1524, 150); my @range; while (my $line = <DATA>) { my ($pos, $from, $to) = split / |\.\./, $line; if ($START == $from && $END == $to) { @range[$pos .. $SIZE + $pos - 1] = (1) x $SIZE; } } say scalar grep $_, @range; __DATA__ 46 1..1524 832 1..1524 1008 1..1524 1407 1..1524 2360 2052..3260 2967 2052..3260 403 1..1524 800 1..1524 2986 2052..3260 3170 2052..3260

    UPDATE Moreover, it seems your description ("Making the array 1 if it contains the position+150 else the array is set to 0") should produce a different output, as you only want to count the part of the 1407 that overlaps 1..1524. The following code does that, and show an alternative approach, without using the array - it uses a hash to remember the positions where the state of the element would change.

    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use Syntax::Construct qw{ // }; my ($START, $END, $SIZE) = (1, 1524, 150); my %borders; while (my $line = <DATA>) { my ($pos, $from, $to) = split / |\.\./, $line; if ($START == $from && $END == $to) { $borders{$pos}++; $borders{ $pos + $SIZE + 1 }--; } } my ($sum, $step) = (0, 0); for my $i ($START .. $END) { $step += $borders{$i} // 0; $sum++ if $step; } say $sum; __DATA__ 46 1..1524 832 1..1524 1008 1..1524 1407 1..1524 2360 2052..3260 2967 2052..3260 403 1..1524 800 1..1524 2986 2052..3260 3170 2052..3260

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: find the coverage of sequence in a particular range
by GotToBTru (Prior) on May 20, 2016 at 12:40 UTC

    The following ranges are covered:

    1. 46 - 196
    2. 403 - 553
    3. 800-950
    4. 832-982
    5. 1008-1158
    6. 1407-1524

    There is an overlap between two of the ranges. Your question is how much of the range between 1 and 1524 is covered. Do I have that correct?

    Update: solution

    use strict; use warnings; my (%covered); open my $m, '<', 'count_try.txt' or die 'Cannot open count_try.txt'; while (my $seq = <$m>) { chomp $seq; my ($start, $range) = split /\t/, $seq; my ($fm,$to) = split /\.\./,$range; my $end = $start + 149; $end = $to if $end > $to; for(my $i=$start;$i<=$end;$i++) { $covered{$range}{$i} = 1 } } for my $range (keys %covered) { printf "Coverage in range %s is %d\n",$range,scalar keys %{$covere +d{$range}} } close $m;
    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

Re: find the coverage of sequence in a particular range
by oiskuu (Hermit) on May 20, 2016 at 20:53 UTC

    One option is to use a module that implements interval sets. CPAN has many; I can't tell which is to be recommended. Here's an example using Set::IntSpan.

    #! /usr/bin/perl use strict; use warnings; use Set::IntSpan; my $seq_len = 150; my (%seqs, %ranges); for (<>) { m/^(\d+)\s+(\d+)\.\.(\d+)/ or next; $seqs{$1} //= [ $1 +0, $1 + $seq_len-1 ]; $ranges{$2,$3} //= [ $2, $3 ]; } my $S = Set::IntSpan->new([ values %seqs ]); printf "Total seq coverage: %d\n", $S->size; for (@ranges{ sort keys %ranges }) { my $window = [ $_->[0], $_->[1] + $seq_len-1 ]; my $T = $S->intersect([ $window ]); printf "range [%d..%d] window [%d..%d] %s covers %d\n", @$_, @$window, $T->run_list, $T->size; } __END__ Total seq coverage: 1251 range [1..1524] window [0..1673] 46-195,403-552,800-981,1008-1157,14 +07-1556 covers 782 range [2052..3260] window [2051..3409] 2360-2509,2967-3135,3170-3319 + covers 469

    To me it looks like there's a bug in that module (ver 1.19): documentation promises the operands are not affected, yet intersect method appears to modify $window. Hm.

      hi... the output you show is correct. I need it that way only. I installed the module Set::IntSpan and run the script but not getting the output. script continues to run without ending and producing no results. Why is it happening. please let me know what have I done wrong. please help

        Obvious first guess: you have not provided the script with any data on stdin.