Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Do any of you know of a method by which any overlapping, adjoining or self containing start and end positions for coloured portions of a string can be merged to produce new start and end positions. I have developed a method that in practice works but in theory could allow for mistakes. ie it is not full proof. Curiosity leads me to ask how easily a full proof method can be coded.

Replies are listed 'Best First'.
Re: Overlapping portions of sub strings
by bart (Canon) on Jan 15, 2003 at 17:38 UTC
    You're talking about ranges, I suppose. I assume all colored ranges actually have the same color. If not, you'll have to work out a more elaborate scheme. (Instead of just colors, it's like having italic and bold ranges, and ranges that are both bold and italic.)

    First, extract the start and end index for each range, two numbers (end >= start) for each range. Next, see if there is any overlap. That means that the start or end index of one range is inside, or touching an end, of another range, or vice versa. Actually, testing the reverse is easier: two ranges don't overlap, if the starting index of the second range comes after the end index of the first one.

    Once you identified two adjoint or overlapping ranges, merge them: start is at the lowest starting index, end is at the highest end index. Be careful to restart all tests with this new range, as the new incarnation of this range now larger than it originally was, and it could overlap with ranges it didn't even touch before.

    I'll try for some code, for size. First I order the ranges according to start index. Next I compare the start index for all ranges after the current one, to see if they overlap. If they do, I incorporate it into this one, delete the other one, and start all tests again for the current range. I think that should cover it.

    Update: Don't use this code without checking out Thelonius' reply, first. He's right that there are some bugs. Out of historical interest, I won't fix them here.

    my @range = ([35, 55], [50, 60], [30, 40], [60, 65], [70, 80]); @range = sort { $a->[0] <=> $b->[0] } @range; for (my $i = 0; $i < @range; $i++) { my $e = $range[$i][1]; for(my $j = $i+1; $j < @range; $j++) { unless($range[$j][0] > $e) { $e = $range[$i][1] = $range[$j][1]; splice @range, $j, 1; redo; } } } use Data::Dumper; print Dumper \@range;
    Result:
    $VAR1 = [ [ 30, 65 ], [ 70, 80 ] ];
    which looks about right to me.
      No, this code will loop infinitely on many inputs, e.g. my @range3 = ([0, 10], [5, 20]); To fix that, you need redo if $j < @range;

      But it is also buggy for contained intervals. The input my @range4 = ([0, 20], [5, 8]); will give the output interval ([0, 8]). Here is the complete correct code:

      for (my $i = 0; $i < @range; $i++) { my $e = $range[$i][1]; for(my $j = $i+1; $j < @range; $j++) { unless($range[$j][0] > $e) { if ($range[$j][1] > $e) { $e = $range[$i][1] = $range[$j][1]; } splice @range, $j, 1; redo if $j < @range; } } }
        Apparently you're right on both accounts. That of the nested range is a stupid mistake of mine, thinking that the end of the second range would always be the larger one because its start is. In fact, what I had in my head before I wrote anything down, was using max():
        use List::Util qw(max); $e = $range[$i][1] = max($e, $range[$j][1]);
        But in fact, your code is simpler, thus, better.

        But that of the infinite loop surprised me, a lot. I had assumed that the condition, the second expression in the for(;;) line, was being tested before any attempt at running the loop body, for each loop.

        But apparently, something goes wrong if you do a redo:
Re: Overlapping portions of sub strings
by BrowserUk (Patriarch) on Jan 16, 2003 at 00:03 UTC

    I don't quite understand what you mean by "coloured portions of a string"? What color would a peice of the string be if a red range wholey contained a yellow portion? Does the yellow become red or does the red range get split into two red ranges either side of the yellow range?

    Anyway, assuming that the ranges are held in a AoA, this version should work pretty quickly and is really simple.

    #! perl -slw use strict; use Data::Dumper; my @ranges = ( [17,19], [34,39], [26,29], [53,57], [43,47], [58,59] , [40,45], [30,33], [20,24], [10,15], [6,9], [1,4], , [35,45], [7, 15], ); #! initialise a string of null as long as the string the ranges perta +in to. my $coal = "\0" x 100; #! fill all the ranges with 1's substr($coal, $_->[0], $_->[1] - $_->[0] + 1) = 1 x ($_->[1] - $_->[0] + + 1) for @ranges; #print $coal; #! Build the new AoA of ranges by scanning the string recording starts + and length of string of 1's my (@new, $start); for (0 .. length $coal) { my $c = substr($coal, $_, 1); next if not $start and $c ne '1'; $start = $_ if $c eq '1' and not $start; next if $start and $c eq '1'; push @new, [$start, $_-1]; $start = undef; } print Dumper \@new; __END__ $VAR1 = [ [ 1, '4' ], [ 6, '15' ], [ 17, '24' ], [ 26, '47' ], [ 53, '59' ] ];

    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      This is quite hard to understand. If you have time, could you please explain this process.

        I've attempted to structure this so that what gets displayed when you use the d/l code link at the bottom, will be a working program, slightly modified from the original. I won't know if it worked until I submit it.

        Hopefully, this is a slightly clearer version than the original.

        #! perl -slw use strict;

        OK, the basic mechanism is this:

        Given the input ranges

        my @ranges = ( [17, 19], [34, 39], [26, 29], [53, 57], [43, 47], [58, 59], [40, 4 +5], [30, 33], [20, 24], [10, 15], [ 6, 9], [ 1, 4], [35, 45], [ 7, 1 +5], );

        If we sort these numerically and lay them out graphically, we can easily project the 1's down vertically and see the resultant ranges on the bottom line. That's exactly what my code is doing.

        =pod comment 1 2 3 4 5 6 Range 123456789012345678901234567890123456789012345678901234567890 [ 1, 4] 1111 [ 6, 9] 1111 [ 7,15] 111111111 [10,15] 111111 [17,19] 111 [20,24] 11111 [26,29] 1111 [30,33] 1111 [34,39] 111111 [35,45] 11111111111 [40,45] 111111 [43,47] 11111 [53,57] 11111 [58,59] 11 1111 1111111111 11111111 1111111111111111111111 1111111 Coaleced ranges [1-4] [6-15] [17-24] [26-47] [53-59] #! initialise a string of null as long as the string the ranges perta +in to. my $coal = "\0" x 100; #! Only need be 60 in ths example but the exces +s bytes don't effect the results. Using substr as an lvalue, assign a string of range[high]-range[low] x + '1' over the null bytes if the result string. #! fill all the ranges with 1's substr($coal, $_->[0], $_->[1] - $_->[0] + 1) = 1 x ($_->[1] - $_->[0] + + 1) for @ranges; =cut

        That might have been clearer written as

        #! Only need be 60 in this example but excess bytes don't effect the r +esults. my $result = "\0" x 100; use constant LOW =>0; use constant HIGH =>1; for my $range (@ranges) { substr( $result, $range->[LOW], $range->[HIGH] - $range->[LOW] +1 +) = '1' x ($range->[HIGH] - $range->[LOW] +1 +) }
        Then, it's just a matter of scanning allong the string linearly and recording the starts and stops of the contguous runs of '1''s.
        #! Build the new AoA of ranges by scanning the string records starts a +nd length of string of 1's my (@results, $start); for (0 .. length $result) { #! Looking at each character in the result + string in turn my $c = substr($result, $_, 1); #! Grab the char #! skip to the next char until we find a 1. next if not $start and $c ne '1'; #! Record the start position if we found a 1 and we haven't alread +y got a start. $start = $_ if $c eq '1' and not $start; #! Skip to the next char if we have start pos until we find the en +d (ne '1') next if $start and $c eq '1'; #! We have a start and we found a non-'1', so we have an end... #! so push the start end pair onto the results array push @results, [$start, $_-1]; #! and undef $start to continue the search. $start = undef; } #! Print out the inputs sorted for human consumption, though that isn' +t necessary of the algorithm print map{local $"='-'; "[@$_] "} sort{ $a->[0] <=> $b->[0] } @ranges; + #!" print ''; #! And the results print map{local $"='-'; "[@$_] "}@results; #!" print ''; __END__ c:\test>227155-3 [1-4] [6-9] [7-15] [10-15] [17-19] [20-24] [26-29] [30-33] [34-39] [35 +-45] [40-45] [43-47] [53-57] [58-59] [1-4] [6-15] [17-24] [26-47] [53-59] c:\test>

        Note: If your string are of really extreme length and space at a premium, you could use a bit-string instead of bytes ("\0" and "1"), by using vec inplace of substr but I don't get on with vec, and my test string is short.

        BTW. If you don't use one yet, the code is a lot clearer to read in a syntax highlighting editor than b&w. Though that's a matter of taste and configuration.


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        Sorry. The message above is addressed to BrowserUk.

      I posted the original question under ‘Anonymous’. What in fact I was talking about was start and end positions for genes within a genome. I wanted to merge overlapping CDS (portions of gene) regions of a genome. Maybe there are too many figures to fit in an array for your method to work with some large data sets. The method I chose was to generate a list in which each line of the list contained the start and end positions of a single CDS region. The list was first ordered by start position and then by end position. New lists were produced recursively. On each recursion, where overlapping was found between a set of two start and end positions, a single line was produced in the new list where a merged region was represented by a start and end position. The reason I said that this was not full proof was because I put a limit on the number of recursions that could take place (in order to limit the number of files generated). I guess that my method could also offer advantages in that line tagging could be used to deal with more complex merging scenarios. Maybe, for example, yellow portions of string could be merged with blue portions but not red.

      Does anyone know of how best these methods can be represented mathematically?

        The AoA's is not a fundemental part of the algorithm, The version below is nearly identical except fro it reads the offset pairs from <DATA>, one pair per line instead of from an AoA.

        The only memory constraint then becomes the length of the string $result, which is defined (though not automatically detected) by the largest offset.

        As the process uses just one pass of two simple loops--one to read the offsets and build the result string, one to read the results string and extract the combined offsets--it should be considerably faster than any algorithm that use multiply recursive passes. You don't even need to sort the input offsets, and it outputs the combined offsets already sorted.

        Hope it proves of some use.

        Slightly modified code


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.