in reply to Overlapping portions of sub strings

I don't quite understand what you mean by "coloured portions of a string"? What color would a peice of the string be if a red range wholey contained a yellow portion? Does the yellow become red or does the red range get split into two red ranges either side of the yellow range?

Anyway, assuming that the ranges are held in a AoA, this version should work pretty quickly and is really simple.

#! perl -slw use strict; use Data::Dumper; my @ranges = ( [17,19], [34,39], [26,29], [53,57], [43,47], [58,59] , [40,45], [30,33], [20,24], [10,15], [6,9], [1,4], , [35,45], [7, 15], ); #! initialise a string of null as long as the string the ranges perta +in to. my $coal = "\0" x 100; #! fill all the ranges with 1's substr($coal, $_->[0], $_->[1] - $_->[0] + 1) = 1 x ($_->[1] - $_->[0] + + 1) for @ranges; #print $coal; #! Build the new AoA of ranges by scanning the string recording starts + and length of string of 1's my (@new, $start); for (0 .. length $coal) { my $c = substr($coal, $_, 1); next if not $start and $c ne '1'; $start = $_ if $c eq '1' and not $start; next if $start and $c eq '1'; push @new, [$start, $_-1]; $start = undef; } print Dumper \@new; __END__ $VAR1 = [ [ 1, '4' ], [ 6, '15' ], [ 17, '24' ], [ 26, '47' ], [ 53, '59' ] ];

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Replies are listed 'Best First'.
Re: Re: Overlapping portions of sub strings
by matth (Monk) on Jan 16, 2003 at 10:32 UTC
    This is quite hard to understand. If you have time, could you please explain this process.

      I've attempted to structure this so that what gets displayed when you use the d/l code link at the bottom, will be a working program, slightly modified from the original. I won't know if it worked until I submit it.

      Hopefully, this is a slightly clearer version than the original.

      #! perl -slw use strict;

      OK, the basic mechanism is this:

      Given the input ranges

      my @ranges = ( [17, 19], [34, 39], [26, 29], [53, 57], [43, 47], [58, 59], [40, 4 +5], [30, 33], [20, 24], [10, 15], [ 6, 9], [ 1, 4], [35, 45], [ 7, 1 +5], );

      If we sort these numerically and lay them out graphically, we can easily project the 1's down vertically and see the resultant ranges on the bottom line. That's exactly what my code is doing.

      =pod comment 1 2 3 4 5 6 Range 123456789012345678901234567890123456789012345678901234567890 [ 1, 4] 1111 [ 6, 9] 1111 [ 7,15] 111111111 [10,15] 111111 [17,19] 111 [20,24] 11111 [26,29] 1111 [30,33] 1111 [34,39] 111111 [35,45] 11111111111 [40,45] 111111 [43,47] 11111 [53,57] 11111 [58,59] 11 1111 1111111111 11111111 1111111111111111111111 1111111 Coaleced ranges [1-4] [6-15] [17-24] [26-47] [53-59] #! initialise a string of null as long as the string the ranges perta +in to. my $coal = "\0" x 100; #! Only need be 60 in ths example but the exces +s bytes don't effect the results. Using substr as an lvalue, assign a string of range[high]-range[low] x + '1' over the null bytes if the result string. #! fill all the ranges with 1's substr($coal, $_->[0], $_->[1] - $_->[0] + 1) = 1 x ($_->[1] - $_->[0] + + 1) for @ranges; =cut

      That might have been clearer written as

      #! Only need be 60 in this example but excess bytes don't effect the r +esults. my $result = "\0" x 100; use constant LOW =>0; use constant HIGH =>1; for my $range (@ranges) { substr( $result, $range->[LOW], $range->[HIGH] - $range->[LOW] +1 +) = '1' x ($range->[HIGH] - $range->[LOW] +1 +) }
      Then, it's just a matter of scanning allong the string linearly and recording the starts and stops of the contguous runs of '1''s.
      #! Build the new AoA of ranges by scanning the string records starts a +nd length of string of 1's my (@results, $start); for (0 .. length $result) { #! Looking at each character in the result + string in turn my $c = substr($result, $_, 1); #! Grab the char #! skip to the next char until we find a 1. next if not $start and $c ne '1'; #! Record the start position if we found a 1 and we haven't alread +y got a start. $start = $_ if $c eq '1' and not $start; #! Skip to the next char if we have start pos until we find the en +d (ne '1') next if $start and $c eq '1'; #! We have a start and we found a non-'1', so we have an end... #! so push the start end pair onto the results array push @results, [$start, $_-1]; #! and undef $start to continue the search. $start = undef; } #! Print out the inputs sorted for human consumption, though that isn' +t necessary of the algorithm print map{local $"='-'; "[@$_] "} sort{ $a->[0] <=> $b->[0] } @ranges; + #!" print ''; #! And the results print map{local $"='-'; "[@$_] "}@results; #!" print ''; __END__ c:\test>227155-3 [1-4] [6-9] [7-15] [10-15] [17-19] [20-24] [26-29] [30-33] [34-39] [35 +-45] [40-45] [43-47] [53-57] [58-59] [1-4] [6-15] [17-24] [26-47] [53-59] c:\test>

      Note: If your string are of really extreme length and space at a premium, you could use a bit-string instead of bytes ("\0" and "1"), by using vec inplace of substr but I don't get on with vec, and my test string is short.

      BTW. If you don't use one yet, the code is a lot clearer to read in a syntax highlighting editor than b&w. Though that's a matter of taste and configuration.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      Sorry. The message above is addressed to BrowserUk.
Re: Re: Overlapping portions of sub strings
by matth (Monk) on Jan 16, 2003 at 10:18 UTC

    I posted the original question under ‘Anonymous’. What in fact I was talking about was start and end positions for genes within a genome. I wanted to merge overlapping CDS (portions of gene) regions of a genome. Maybe there are too many figures to fit in an array for your method to work with some large data sets. The method I chose was to generate a list in which each line of the list contained the start and end positions of a single CDS region. The list was first ordered by start position and then by end position. New lists were produced recursively. On each recursion, where overlapping was found between a set of two start and end positions, a single line was produced in the new list where a merged region was represented by a start and end position. The reason I said that this was not full proof was because I put a limit on the number of recursions that could take place (in order to limit the number of files generated). I guess that my method could also offer advantages in that line tagging could be used to deal with more complex merging scenarios. Maybe, for example, yellow portions of string could be merged with blue portions but not red.

    Does anyone know of how best these methods can be represented mathematically?

      The AoA's is not a fundemental part of the algorithm, The version below is nearly identical except fro it reads the offset pairs from <DATA>, one pair per line instead of from an AoA.

      The only memory constraint then becomes the length of the string $result, which is defined (though not automatically detected) by the largest offset.

      As the process uses just one pass of two simple loops--one to read the offsets and build the result string, one to read the results string and extract the combined offsets--it should be considerably faster than any algorithm that use multiply recursive passes. You don't even need to sort the input offsets, and it outputs the combined offsets already sorted.

      Hope it proves of some use.

      Slightly modified code


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.