Help the counting!

Replies are listed 'Best First'.
Re: Help the counting! by jethro (Monsignor) on Mar 22, 2012 at 12:08 UTC
What is bp? Without exactly knowing what you want I can tell you that your script seems to have 3 loops nested into each other which will make the running time of your script unbearable for non-trivial data. And your problem doesn't seem to need that. How about this algorithm: Sort the ranges in ascending start position (you need to use a more complex data structure like Array-of-Arrays for this). Store the range of the first cluster into $xstart and $xend. For each new cluster (outer loop) test if the new cluster overlaps and if yes, count the overlap (inner loop) and update $xend with the end of the range	[reply]
Re^2: Help the counting! by g-alone (Initiate) on Mar 22, 2012 at 12:39 UTC
I guess Easiest way is to show you what my data looks like in simpler way bp = basepair I have data of different chromosomes with start and end point of cluster tags in each of chromosomes it looks like : columns = 1: chr 2: start 3:end 4: info(X) 5: info(X) 6:strand chr1 101 105 X X - chr1 102 108 X X - chr1 106 111 X X - chr1 112 113 X X - chr1 113 115 X X - chr2 114 118 X X - chr2 119 121 X X - chr2 120 123 X X - chr3 125 130 X X - chr3 131 132 X X - I need column 1 - 2 -3 - 6 I want to count the overlappes with 2 basepair overlappes for each cor +dinates in each chromosome and give ID number for those over laps tog +ether like in chr1 there are 4 cordinates and 3 of them have overlapp +ed except the last cordinate : so I will need to first count those 3 ove +rlappes and give them ID_1 then ID_2 will give to last cordinate in chr1 which + is not have overlapped with others . then Counting is became zero for the chr2 and check inside the chr2 co +rdinates for overlap and give ID from ID_1 for chr2 again and countin +g and so on for all chromosomes and give the out put like this : TSSD_ID chr start end strand count ID_1 1 101 111 - 3 ID_2 1 112 113 - 1 ID_3 1 113 115 - 1 ID_1 2 114 118 - 1 ID_2 2 119 123 - 2 ID_1 3 125 132 - 1 [download] this the output I want to get but Mine is not working well !	[reply] [d/l]
Re^3: Help the counting! by jethro (Monsignor) on Mar 22, 2012 at 13:48 UTC
I think you made a mistake with chromosom 3 in your example output Here is a working solution: #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my @chrom; while (<>) { my ($chr, $start,$end,$c,$d,$strand)= split; push @chrom, {'chrom'=>$chr, 'start'=>$start, 'end'=>$end, 'strand +'=>$strand }; } my @chroms= sort { $a->{start} <=> $b->{start} } @chrom; #print Dumper(\@chroms); exit(0) if (@chroms==0); my $coord= shift @chroms; #use first coordinate as range counter my $overlap=1; my $id= 1; while( my $co= shift @chroms ) { if ($co->{chrom} ne $coord->{chrom}) { printresults($coord,$overlap,$id); $coord= $co; $overlap=1; $id=1; } else { if ($co->{start}>=$coord->{end}) { printresults($coord,$overlap,$id); $coord= $co; $overlap=1; $id++; } else { $overlap++; $coord->{end}= $co->{end}; } } } printresults($coord,$overlap,$id); #------------- sub printresults { my ($coord,$overlap,$id)= @_; print "ID_$id $coord->{chrom} $coord->{start} $coor +d->{end} $coord->{strand} $overlap\n"; } [download] prints `ID_1 chr1 101 111 - 3 ID_2 chr1 112 113 - 1 ID_3 chr1 113 115 - 1 ID_1 chr2 114 118 - 1 ID_2 chr2 119 123 - 2 ID_1 chr3 125 130 - 1 ID_2 chr3 131 132 - 1` [download] You can remove the '#' in front of the 'print Dumper' line if you want to see how the data in @chroms looks like	[reply] [d/l] [select]
Re: Counting problem! by JavaFan (Canon) on Mar 22, 2012 at 13:47 UTC
my %data; while (<DATA>) { my ($key, $from, $end) = split; if (!$data {$key}) { $data {$key} = [[$from, $end, 1]]; } else { my @new; my $count = 1; foreach my $segment (@{$data{$key}}) { if ($from < $$segment[1] && $end > $$segment[0]) { # # Must overlap, so merge # $from = $$segment[0] if $$segment[0] < $from; $end = $$segment[1] if $$segment[1] > $end; $count += $$segment[2]; } else { # # No overlap. Keep. # push @new, $segment; } } push @new, [$from, $end, $count]; $data{$key} = [sort {$$a[0] <=> $$b[0]} @new]; } } foreach my $key (sort keys %data) { my @segments = @{$data{$key}}; for (my $i = 0; $i < @segments; $i++) { printf "ID_%d %s %d %d %d\n", $i + 1, $key, @{$segmen +ts[$i]}; } } __DATA__ chr1 101 105 X X - chr1 102 108 X X - chr1 106 111 X X - chr1 112 113 X X - chr1 113 115 X X - chr2 114 118 X X - chr2 119 121 X X - chr2 120 123 X X - chr3 125 130 X X - chr3 131 132 X X - [download] As output, I get: `ID_1 chr1 101 111 3 ID_2 chr1 112 113 1 ID_3 chr1 113 115 1 ID_1 chr2 114 118 1 ID_2 chr2 119 123 2 ID_1 chr3 125 130 1 ID_2 chr3 131 132 1` [download] Note this has two ranges from `chr3`; not 1. I don't see anything spanning the gap from 130 to 131 in your input.	[reply] [d/l] [select]
Re^2: Counting problem! by g-alone (Initiate) on Mar 22, 2012 at 18:48 UTC
Thank you so much for your help, There is another question that I would like to know if I want to get out put with more details what should I do, cause In some part of my data I have similar coordinates which make problem with script they look like : __DATA__ chr1 101 105 X X - chr1 101 105 X X - chr1 101 105 X X - chr1 101 105 X X - chr1 102 108 X X - chr1 106 111 X X - chr1 106 111 X X - chr1 112 113 X X - chr1 113 115 X X - chr2 114 118 X X - chr2 114 118 X X - chr2 114 118 X X - chr2 114 118 X X - chr2 119 121 X X - chr2 120 123 X X - chr3 125 130 X X - chr3 125 130 X X - chr3 131 132 X X - then I want to get output like this : output : ID_1 chr1 101 105 4 ID_1 chr1 102 106 1 ID_1 chr1 106 111 2 ID_2 chr1 112 113 1 ID_3 chr1 113 115 1 ID_1 chr2 114 118 1 ID_2 chr2 119 123 1 ID_2 chr2 119 123 1 ID_1 chr3 125 130 2 ID_2 chr3 131 132 1 [download] which it counts the number of similar coordinates as well and show them individually instead of total counting by merging them.	[reply] [d/l]
Re^3: Counting problem! by JavaFan (Canon) on Mar 22, 2012 at 19:35 UTC
I tried to make sense of your output, but failed. Probably the fourth column of the second row was typoed, and should read 108. But I've no idea why the first three output lines all have the same value in the first column. However, to count, that's trivial. Make a hash key from the primary key of your data, and just, uhm, count. You know, add one each time you see the same thing: `$counts{$key,$start,$end}++;` [download]	[reply] [d/l]
Re: Help the counting! by Anonymous Monk on Mar 22, 2012 at 11:22 UTC
what should i do?! :( Give example of output that matches your sample input, see How do I post a question effectively?	[reply]
Re^2: Help the counting! by g-alone (Initiate) on Mar 22, 2012 at 11:48 UTC
TSSD_ID Chr Start End Count + Strand ID_1 chr21 9417056 9417076 1 - ID_2 chr21 9417667 9417687 1 - ID_3 chr21 9512082 9512102 2 - ID_3 chr21 9512082 9512102 2 - ID_5 chr21 9535321 9535341 1 - ID_6 chr21 9540527 9540547 1 - ID_7 chr21 9544899 9544919 1 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_56 chr21 9575389 9575409 1 - ID_57 chr21 9652071 9652091 1 - ID_58 chr21 9671166 9671186 3 - ID_58 chr21 9671166 9671186 3 - ID_58 chr21 9671166 9671186 3 - ID_61 chr21 9700592 9700612 1 - ID_62 chr21 9704082 9704102 1 - ID_63 chr21 9705333 9705353 1 - [download] if you look at the first column the ID number is jumping from 2 to 5 and from 8 to 52 as you see it adds the counting the column 5 to the ID ! but I dont want it to do that ! I want to add +1 to ID number when the entries in column three (start column) for example 1st entry - 2nd entry > 2 and add +1 to column5 when 1st entry - 2nd entry < 2 (in column 3 (start) ) I mean for entry and the next entry and give me unique output instead of printing for many times like : `TSSD_ID Chr Start End Count + Strand ID_1 chr21 9417056 9417076 1 - ID_2 chr21 9417667 9417687 1 - ID_3 chr21 9512081 9512102 1 - ID_3 chr21 9512082 9512102 3 - ID_3 chr21 9512083 9512104 1 - ID_3 chr21 9512085 9512120 1 - ID_5 chr21 9535321 9535341 1 - ID_6 chr21 9540527 9540547 1 - ID_7 chr21 9544899 9544919 1 - ID_8 chr21 9564970 9564990 1 - ID_8 chr21 9564975 9564993 2 - ID_9 chr21 9564970 9564990 8 - ID_10 chr21 9564970 9564990 4 - ID_11 chr21 9564970 9564990 7 -` [download] as you see in ID_3 in one line count is 3 which I means at that position there are three identical clusters again in ID_3 there are some with counting 1 which are in that area and covered in 2bp but are not same like this : `\|=====\| \|=====\| there are identical in ID_3 and count as 3 \|=====\| \|=====\| \|=====\| and these two are again in ID_3 but there are not Identical so they have different counting as each of them count as one! and when \|=====\| \|=====\| they do not have 2bp overlap the ID will change to ID_4 as you see in ID_3 all of them had 2bp overlapped` [download]	[reply] [d/l] [select]
Re: Help the counting! by Jenda (Abbot) on Mar 23, 2012 at 08:57 UTC
I do not understand a word. Sorry to say that, but your English looks like a product of Google Translate. (No offence to Google Translate either, but when asked to translate between sufficiently different languages it still produces texts that hardly make sense.) Please write the post again, this time in your native language and I'm sure there will be someone able to understand and if they can't answer the question themselves, I'm sure they will translate your post for us to help. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]

What is bp?

Without exactly knowing what you want I can tell you that your script seems to have 3 loops nested into each other which will make the running time of your script unbearable for non-trivial data. And your problem doesn't seem to need that.

How about this algorithm: Sort the ranges in ascending start position (you need to use a more complex data structure like Array-of-Arrays for this). Store the range of the first cluster into $xstart and $xend. For each new cluster (outer loop) test if the new cluster overlaps and if yes, count the overlap (inner loop) and update $xend with the end of the range

[reply]

columns = 1: chr  2: start 3:end 4: info(X) 5: info(X) 6:strand 

chr1    101  105   X     X     -
chr1    102  108   X     X     -
chr1    106  111   X     X     -
chr1    112  113   X     X     -
chr1    113  115   X     X     -
chr2    114  118   X     X     -  
chr2    119  121   X     X     -
chr2    120  123   X     X    -
chr3    125  130   X     X    -
chr3    131  132   X     X   -

I need column 1 - 2 -3 - 6 

I want to count the overlappes with 2 basepair overlappes for each cor
+dinates in each chromosome and give ID number for those over laps tog
+ether like in chr1 there are 4 cordinates and 3 of them have overlapp
+ed 
except the last cordinate  : so I will need to first count those 3 ove
+rlappes 
and give them ID_1 then ID_2 will give to last cordinate in chr1 which
+ is not have overlapped with others .
then Counting is became zero for the chr2 and check inside the chr2 co
+rdinates for overlap and give ID from ID_1 for chr2 again and countin
+g and so on for all chromosomes and give the out put like this :

TSSD_ID        chr       start        end      strand      count
  ID_1         1          101          111       -           3
  ID_2         1          112          113       -           1
  ID_3         1          113          115       -           1
  ID_1         2          114          118       -           1   
  ID_2         2          119          123       -           2 
  ID_1         3          125          132       -           1
[download]

[reply]
[d/l]

I think you made a mistake with chromosom 3 in your example output

Here is a working solution:

#!/usr/bin/perl
use warnings;
use strict;
use Data::Dumper;

my @chrom;

while (<>) {
    my ($chr, $start,$end,$c,$d,$strand)= split;
    push @chrom, {'chrom'=>$chr, 'start'=>$start, 'end'=>$end, 'strand
+'=>$strand };

}
my @chroms= sort { $a->{start} <=> $b->{start} } @chrom;

#print Dumper(\@chroms);

exit(0) if (@chroms==0);
my $coord= shift @chroms; #use first coordinate as range counter
my $overlap=1;
my $id= 1;

while( my $co= shift @chroms ) {
    if ($co->{chrom} ne $coord->{chrom}) {
    printresults($coord,$overlap,$id);
    $coord= $co;
    $overlap=1;
    $id=1;
    }
    else {
    if ($co->{start}>=$coord->{end}) {
        printresults($coord,$overlap,$id);
        $coord= $co;
        $overlap=1;
        $id++;
    }
    else {
        $overlap++;
        $coord->{end}= $co->{end};
    }
    }
}

printresults($coord,$overlap,$id);



#-------------

sub printresults {
    my ($coord,$overlap,$id)= @_;
    print "ID_$id       $coord->{chrom}      $coord->{start}     $coor
+d->{end}    $coord->{strand}      $overlap\n";
}
[download]

ID_1       chr1      101     111    -      3
ID_2       chr1      112     113    -      1
ID_3       chr1      113     115    -      1
ID_1       chr2      114     118    -      1
ID_2       chr2      119     123    -      2
ID_1       chr3      125     130    -      1
ID_2       chr3      131     132    -      1
[download]

You can remove the '#' in front of the 'print Dumper' line if you want to see how the data in @chroms looks like

[reply]
[d/l]
[select]

my %data;
while (<DATA>) {
    my ($key, $from, $end) = split;
    if (!$data {$key}) {
        $data {$key} = [[$from, $end, 1]];
    }
    else {
        my @new;
        my $count = 1;
        foreach my $segment (@{$data{$key}}) { 
            if ($from < $$segment[1] && $end > $$segment[0]) { 
                #
                # Must overlap, so merge
                #
                $from = $$segment[0] if $$segment[0] < $from;
                $end = $$segment[1] if $$segment[1] > $end; 
                $count += $$segment[2];
            }
            else {
                #
                # No overlap. Keep.
                #
                push @new, $segment;  
            }
        }
        push @new, [$from, $end, $count];  

        $data{$key} = [sort {$$a[0] <=> $$b[0]} @new];
    }
}

foreach my $key (sort keys %data) {
    my @segments = @{$data{$key}};
    for (my $i = 0; $i < @segments; $i++) {
        printf "ID_%d    %s   %d   %d   %d\n", $i + 1, $key, @{$segmen
+ts[$i]};
    }
}
    
__DATA__
chr1    101  105   X     X     -
chr1    102  108   X     X     -
chr1    106  111   X     X     -
chr1    112  113   X     X     -
chr1    113  115   X     X     -
chr2    114  118   X     X     -
chr2    119  121   X     X     -
chr2    120  123   X     X    -
chr3    125  130   X     X    -
chr3    131  132   X     X   -
[download]

ID_1    chr1   101   111   3
ID_2    chr1   112   113   1
ID_3    chr1   113   115   1
ID_1    chr2   114   118   1
ID_2    chr2   119   123   2
ID_1    chr3   125   130   1
ID_2    chr3   131   132   1
[download]

chr3

[reply]
[d/l]
[select]


__DATA__
chr1    101  105   X     X     -
chr1    101  105   X     X     -
chr1    101  105   X     X     -
chr1    101  105   X     X     -
chr1    102  108   X     X     -
chr1    106  111   X     X     -
chr1    106  111   X     X     -
chr1    112  113   X     X     -
chr1    113  115   X     X     -
chr2    114  118   X     X     -
chr2    114  118   X     X     -
chr2    114  118   X     X     -
chr2    114  118   X     X     -
chr2    119  121   X     X     -
chr2    120  123   X     X     -
chr3    125  130   X     X     -
chr3    125  130   X     X     -
chr3    131  132   X     X     -

then I want to get output like this :

output :

ID_1    chr1   101   105   4
ID_1    chr1   102   106   1
ID_1    chr1   106   111   2
ID_2    chr1   112   113   1
ID_3    chr1   113   115   1
ID_1    chr2   114   118   1
ID_2    chr2   119   123   1
ID_2    chr2   119   123   1
ID_1    chr3   125   130   2
ID_2    chr3   131   132   1
[download]

[reply]
[d/l]

However, to count, that's trivial. Make a hash key from the primary key of your data, and just, uhm, count. You know, add one each time you see the same thing:

    $counts{$key,$start,$end}++;
[download]

[reply]
[d/l]

what should i do?! :(

Give example of output that matches your sample input, see How do I post a question effectively?

[reply]

TSSD_ID     Chr              Start                  End      Count    
+ Strand 
ID_1      chr21      9417056      9417076 1      -  
ID_2      chr21      9417667      9417687 1      -  
ID_3      chr21      9512082      9512102 2      -  
ID_3      chr21      9512082      9512102 2      -  
ID_5      chr21      9535321      9535341 1      -  
ID_6      chr21      9540527      9540547 1      -  
ID_7      chr21      9544899      9544919 1      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      - 
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_8      chr21      9564970      9564990 48      -  
ID_56      chr21      9575389      9575409 1      -  
ID_57      chr21      9652071      9652091 1      -  
ID_58      chr21      9671166      9671186 3      -  
ID_58      chr21      9671166      9671186 3      -  
ID_58      chr21      9671166      9671186 3      -  
ID_61      chr21      9700592      9700612 1      -  
ID_62      chr21      9704082      9704102 1      -  
ID_63      chr21      9705333      9705353 1      -
[download]

TSSD_ID     Chr              Start                  End      Count    
+ Strand 
ID_1      chr21      9417056      9417076 1      -  
ID_2      chr21      9417667      9417687 1      -  
ID_3      chr21      9512081      9512102 1      -  
ID_3      chr21      9512082      9512102 3      -  
ID_3      chr21      9512083      9512104 1      -  
ID_3      chr21      9512085      9512120 1      -  
ID_5      chr21      9535321      9535341 1      -  
ID_6      chr21      9540527      9540547 1      -  
ID_7      chr21      9544899      9544919 1      -  
ID_8      chr21      9564970      9564990 1      -  
ID_8      chr21      9564975      9564993 2      -  
ID_9      chr21      9564970      9564990 8      -  
ID_10      chr21      9564970      9564990 4      -  
ID_11     chr21      9564970      9564990 7      -
[download]

|=====|
|=====|      there are identical in ID_3 and count as 3
|=====|
        |=====|
               |=====|  and these two are again 
                                 in ID_3 but there are not
                                 Identical so they have different
                           counting as each of them count as one!
and 
when 
|=====|
             |=====|    they do not have 2bp overlap
                                 the ID will change to ID_4
as you see in ID_3 all of them had 2bp overlapped
[download]

[reply]
[d/l]
[select]

I do not understand a word. Sorry to say that, but your English looks like a product of Google Translate. (No offence to Google Translate either, but when asked to translate between sufficiently different languages it still produces texts that hardly make sense.) Please write the post again, this time in your native language and I'm sure there will be someone able to understand and if they can't answer the question themselves, I'm sure they will translate your post for us to help.

Jenda
Enoch was right!
Enjoy the last years of Rome.

[reply]