g-alone has asked for the wisdom of the Perl Monks concerning the following question:

I have input file with these kind of data :
chr start end chr21 9411509 9411529 chr21 9411510 9411529 chr21 9411509 9411529 chr21 9411569 9411589 chr21 9411571 9411591 chr21 9411572 9411592 chr21 9411573 9411593 chr21 9411575 9411595 chr21 9411578 9411598
I want to have output which counting the number of overlapped between 1st position in start and end position with 2bp overlap : Imagine the start pos of the clusters are and end of them could have over lap with 2bp like the pic at the bottom and it counts 4
|-----| |-----| |-----| |------|
and if it does not have overlap like : do not count it and give each of them one ID which are not overlapped
|----| |----| |-----| |-----|
and get output like this
TSSD_ID Start position End position Count overlapped ID_1 XXXXX XXXXXX 54
I make the script but It's not work well :
#!/usr/bin/perl -w use strict; use warnings; my $data = <>; my $tol = 2; my @chr; my @start; my @end; my @count; my @strand; while ( $data = (<>) ) { my ( $chr, $start, $end, $info, $cnt, $strand, $start2, $end2, $color, $cnt2, $length , $zero ) = split( "\t", $data ); push( @chr, $chr); push (@start, $start); push (@end, $end); push (@strand, $strand); } my $cl1length = @start; my $cl2length = @end; print "TSSD_ID \t Start \t End \t Count\t Strand \n" ; foreach ( my $i =0 ; $i < $cl1length ; $i = $i++) { for ( my $i = 1 ; $i < $cl1length ; $i++ ) { my $ID = 1; my $count = 0; for (my $j = 0 ; $j < $cl2length ; $j++) { if ( $tol < ( $start[$i] - $start[$i+1] ) ) { $ID++; } if ($start[$i]-$start[$j] == 0) { $count++;}} print "ID_$ID \t $chr[$i] \t $start[$i] \t $end[$i] $count \t $st +rand[$i] \n"; }}
what should i do?! :(

Replies are listed 'Best First'.
Re: Help the counting!
by jethro (Monsignor) on Mar 22, 2012 at 12:08 UTC

    What is bp?

    Without exactly knowing what you want I can tell you that your script seems to have 3 loops nested into each other which will make the running time of your script unbearable for non-trivial data. And your problem doesn't seem to need that.

    How about this algorithm: Sort the ranges in ascending start position (you need to use a more complex data structure like Array-of-Arrays for this). Store the range of the first cluster into $xstart and $xend. For each new cluster (outer loop) test if the new cluster overlaps and if yes, count the overlap (inner loop) and update $xend with the end of the range

      I guess Easiest way is to show you what my data looks like in simpler way bp = basepair I have data of different chromosomes with start and end point of cluster tags in each of chromosomes it looks like :
      columns = 1: chr 2: start 3:end 4: info(X) 5: info(X) 6:strand chr1 101 105 X X - chr1 102 108 X X - chr1 106 111 X X - chr1 112 113 X X - chr1 113 115 X X - chr2 114 118 X X - chr2 119 121 X X - chr2 120 123 X X - chr3 125 130 X X - chr3 131 132 X X - I need column 1 - 2 -3 - 6 I want to count the overlappes with 2 basepair overlappes for each cor +dinates in each chromosome and give ID number for those over laps tog +ether like in chr1 there are 4 cordinates and 3 of them have overlapp +ed except the last cordinate : so I will need to first count those 3 ove +rlappes and give them ID_1 then ID_2 will give to last cordinate in chr1 which + is not have overlapped with others . then Counting is became zero for the chr2 and check inside the chr2 co +rdinates for overlap and give ID from ID_1 for chr2 again and countin +g and so on for all chromosomes and give the out put like this : TSSD_ID chr start end strand count ID_1 1 101 111 - 3 ID_2 1 112 113 - 1 ID_3 1 113 115 - 1 ID_1 2 114 118 - 1 ID_2 2 119 123 - 2 ID_1 3 125 132 - 1
      this the output I want to get but Mine is not working well !

        I think you made a mistake with chromosom 3 in your example output

        Here is a working solution:

        #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my @chrom; while (<>) { my ($chr, $start,$end,$c,$d,$strand)= split; push @chrom, {'chrom'=>$chr, 'start'=>$start, 'end'=>$end, 'strand +'=>$strand }; } my @chroms= sort { $a->{start} <=> $b->{start} } @chrom; #print Dumper(\@chroms); exit(0) if (@chroms==0); my $coord= shift @chroms; #use first coordinate as range counter my $overlap=1; my $id= 1; while( my $co= shift @chroms ) { if ($co->{chrom} ne $coord->{chrom}) { printresults($coord,$overlap,$id); $coord= $co; $overlap=1; $id=1; } else { if ($co->{start}>=$coord->{end}) { printresults($coord,$overlap,$id); $coord= $co; $overlap=1; $id++; } else { $overlap++; $coord->{end}= $co->{end}; } } } printresults($coord,$overlap,$id); #------------- sub printresults { my ($coord,$overlap,$id)= @_; print "ID_$id $coord->{chrom} $coord->{start} $coor +d->{end} $coord->{strand} $overlap\n"; }
        prints
        ID_1 chr1 101 111 - 3 ID_2 chr1 112 113 - 1 ID_3 chr1 113 115 - 1 ID_1 chr2 114 118 - 1 ID_2 chr2 119 123 - 2 ID_1 chr3 125 130 - 1 ID_2 chr3 131 132 - 1

        You can remove the '#' in front of the 'print Dumper' line if you want to see how the data in @chroms looks like

Re: Counting problem!
by JavaFan (Canon) on Mar 22, 2012 at 13:47 UTC
    my %data; while (<DATA>) { my ($key, $from, $end) = split; if (!$data {$key}) { $data {$key} = [[$from, $end, 1]]; } else { my @new; my $count = 1; foreach my $segment (@{$data{$key}}) { if ($from < $$segment[1] && $end > $$segment[0]) { # # Must overlap, so merge # $from = $$segment[0] if $$segment[0] < $from; $end = $$segment[1] if $$segment[1] > $end; $count += $$segment[2]; } else { # # No overlap. Keep. # push @new, $segment; } } push @new, [$from, $end, $count]; $data{$key} = [sort {$$a[0] <=> $$b[0]} @new]; } } foreach my $key (sort keys %data) { my @segments = @{$data{$key}}; for (my $i = 0; $i < @segments; $i++) { printf "ID_%d %s %d %d %d\n", $i + 1, $key, @{$segmen +ts[$i]}; } } __DATA__ chr1 101 105 X X - chr1 102 108 X X - chr1 106 111 X X - chr1 112 113 X X - chr1 113 115 X X - chr2 114 118 X X - chr2 119 121 X X - chr2 120 123 X X - chr3 125 130 X X - chr3 131 132 X X -
    As output, I get:
    ID_1 chr1 101 111 3 ID_2 chr1 112 113 1 ID_3 chr1 113 115 1 ID_1 chr2 114 118 1 ID_2 chr2 119 123 2 ID_1 chr3 125 130 1 ID_2 chr3 131 132 1
    Note this has two ranges from chr3; not 1. I don't see anything spanning the gap from 130 to 131 in your input.
      Thank you so much for your help, There is another question that I would like to know if I want to get out put with more details what should I do, cause In some part of my data I have similar coordinates which make problem with script they look like :
      __DATA__ chr1 101 105 X X - chr1 101 105 X X - chr1 101 105 X X - chr1 101 105 X X - chr1 102 108 X X - chr1 106 111 X X - chr1 106 111 X X - chr1 112 113 X X - chr1 113 115 X X - chr2 114 118 X X - chr2 114 118 X X - chr2 114 118 X X - chr2 114 118 X X - chr2 119 121 X X - chr2 120 123 X X - chr3 125 130 X X - chr3 125 130 X X - chr3 131 132 X X - then I want to get output like this : output : ID_1 chr1 101 105 4 ID_1 chr1 102 106 1 ID_1 chr1 106 111 2 ID_2 chr1 112 113 1 ID_3 chr1 113 115 1 ID_1 chr2 114 118 1 ID_2 chr2 119 123 1 ID_2 chr2 119 123 1 ID_1 chr3 125 130 2 ID_2 chr3 131 132 1
      which it counts the number of similar coordinates as well and show them individually instead of total counting by merging them.
        I tried to make sense of your output, but failed. Probably the fourth column of the second row was typoed, and should read 108. But I've no idea why the first three output lines all have the same value in the first column.

        However, to count, that's trivial. Make a hash key from the primary key of your data, and just, uhm, count. You know, add one each time you see the same thing:

        $counts{$key,$start,$end}++;
Re: Help the counting!
by Anonymous Monk on Mar 22, 2012 at 11:22 UTC
      TSSD_ID Chr Start End Count + Strand ID_1 chr21 9417056 9417076 1 - ID_2 chr21 9417667 9417687 1 - ID_3 chr21 9512082 9512102 2 - ID_3 chr21 9512082 9512102 2 - ID_5 chr21 9535321 9535341 1 - ID_6 chr21 9540527 9540547 1 - ID_7 chr21 9544899 9544919 1 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_8 chr21 9564970 9564990 48 - ID_56 chr21 9575389 9575409 1 - ID_57 chr21 9652071 9652091 1 - ID_58 chr21 9671166 9671186 3 - ID_58 chr21 9671166 9671186 3 - ID_58 chr21 9671166 9671186 3 - ID_61 chr21 9700592 9700612 1 - ID_62 chr21 9704082 9704102 1 - ID_63 chr21 9705333 9705353 1 -
      if you look at the first column the ID number is jumping from 2 to 5 and from 8 to 52 as you see it adds the counting the column 5 to the ID ! but I dont want it to do that ! I want to add +1 to ID number when the entries in column three (start column) for example 1st entry - 2nd entry > 2 and add +1 to column5 when 1st entry - 2nd entry < 2 (in column 3 (start) ) I mean for entry and the next entry and give me unique output instead of printing for many times like :
      TSSD_ID Chr Start End Count + Strand ID_1 chr21 9417056 9417076 1 - ID_2 chr21 9417667 9417687 1 - ID_3 chr21 9512081 9512102 1 - ID_3 chr21 9512082 9512102 3 - ID_3 chr21 9512083 9512104 1 - ID_3 chr21 9512085 9512120 1 - ID_5 chr21 9535321 9535341 1 - ID_6 chr21 9540527 9540547 1 - ID_7 chr21 9544899 9544919 1 - ID_8 chr21 9564970 9564990 1 - ID_8 chr21 9564975 9564993 2 - ID_9 chr21 9564970 9564990 8 - ID_10 chr21 9564970 9564990 4 - ID_11 chr21 9564970 9564990 7 -
      as you see in ID_3 in one line count is 3 which I means at that position there are three identical clusters again in ID_3 there are some with counting 1 which are in that area and covered in 2bp but are not same like this :
      |=====| |=====| there are identical in ID_3 and count as 3 |=====| |=====| |=====| and these two are again in ID_3 but there are not Identical so they have different counting as each of them count as one! and when |=====| |=====| they do not have 2bp overlap the ID will change to ID_4 as you see in ID_3 all of them had 2bp overlapped
Re: Help the counting!
by Jenda (Abbot) on Mar 23, 2012 at 08:57 UTC

    I do not understand a word. Sorry to say that, but your English looks like a product of Google Translate. (No offence to Google Translate either, but when asked to translate between sufficiently different languages it still produces texts that hardly make sense.) Please write the post again, this time in your native language and I'm sure there will be someone able to understand and if they can't answer the question themselves, I'm sure they will translate your post for us to help.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.