Overlapping portions of sub strings

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Overlapping portions of sub strings
by bart (Canon) on Jan 15, 2003 at 17:38 UTC

First, extract the start and end index for each range, two numbers (end >= start) for each range. Next, see if there is any overlap. That means that the start or end index of one range is inside, or touching an end, of another range, or vice versa. Actually, testing the reverse is easier: two ranges don't overlap, if the starting index of the second range comes after the end index of the first one.

Once you identified two adjoint or overlapping ranges, merge them: start is at the lowest starting index, end is at the highest end index. Be careful to restart all tests with this new range, as the new incarnation of this range now larger than it originally was, and it could overlap with ranges it didn't even touch before.

I'll try for some code, for size. First I order the ranges according to start index. Next I compare the start index for all ranges after the current one, to see if they overlap. If they do, I incorporate it into this one, delete the other one, and start all tests again for the current range. I think that should cover it.

Update: Don't use this code without checking out Thelonius' reply, first. He's right that there are some bugs. Out of historical interest, I won't fix them here.

my @range = ([35, 55], [50, 60], [30, 40], [60, 65], [70, 80]);

@range = sort { $a->[0] <=> $b->[0] } @range;

for (my $i = 0; $i < @range; $i++) {
    my $e = $range[$i][1];
    for(my $j = $i+1; $j < @range; $j++) {
        unless($range[$j][0] > $e) {
            $e = $range[$i][1] = $range[$j][1];
            splice @range, $j, 1;
            redo;
       }
   }
}

use Data::Dumper;
print Dumper \@range;
[download]

$VAR1 = [ [ 30, 65 ], [ 70, 80 ] ];
[download]

[reply]
[d/l]
[select]

Re: Re: Overlapping portions of sub strings

by Thelonius (Priest) on Jan 15, 2003 at 20:57 UTC

my @range3 = ([0, 10], [5, 20]);

redo if $j < @range;

But it is also buggy for contained intervals. The input my @range4 = ([0, 20], [5, 8]); will give the output interval ([0, 8]). Here is the complete correct code:

for (my $i = 0; $i < @range; $i++) {
  my $e = $range[$i][1];
  for(my $j = $i+1; $j < @range; $j++) {
    unless($range[$j][0] > $e) {
      if ($range[$j][1] > $e) {
        $e = $range[$i][1] = $range[$j][1];
      }
      splice @range, $j, 1;
      redo if $j < @range;
    }
  }
}
[download]

[reply]
[d/l]
[select]

Re: Re: Re: Overlapping portions of sub strings

by bart (Canon) on Jan 15, 2003 at 21:54 UTC

use List::Util qw(max);
$e = $range[$i][1] = max($e, $range[$j][1]);
[download]

But that of the infinite loop surprised me, a lot. I had assumed that the condition, the second expression in the for(;;) line, was being tested before any attempt at running the loop body, for each loop.

Read more... (3 kB)

[reply]
[d/l]
[select]

Re: Overlapping portions of sub strings
by BrowserUk (Patriarch) on Jan 16, 2003 at 00:03 UTC

I don't quite understand what you mean by "coloured portions of a string"? What color would a peice of the string be if a red range wholey contained a yellow portion? Does the yellow become red or does the red range get split into two red ranges either side of the yellow range?

Anyway, assuming that the ranges are held in a AoA, this version should work pretty quickly and is really simple.

#! perl -slw
use strict;
use Data::Dumper;

my @ranges = ( [17,19], [34,39], [26,29], [53,57], [43,47], [58,59]
             , [40,45], [30,33], [20,24], [10,15], [6,9],   [1,4],
             , [35,45], [7, 15],

);

#! initialise a string  of null as long as the string the ranges perta
+in to.
my $coal = "\0" x 100;

#! fill all the ranges with 1's
substr($coal, $_->[0], $_->[1] - $_->[0] + 1) = 1 x ($_->[1] - $_->[0]
+ + 1) for @ranges;

#print $coal;

#! Build the new AoA of ranges by scanning the string recording starts
+ and length of string of 1's
my (@new, $start);
for (0 .. length $coal) {
    my $c = substr($coal, $_, 1);
    next if not $start and $c ne '1';
    $start = $_ if $c eq '1' and not $start;
    next if $start and $c eq '1';
    push @new, [$start, $_-1];
    $start = undef;
}

print Dumper \@new;
__END__
$VAR1 = [
          [
            1,
            '4'
          ],
          [
            6,
            '15'
          ],
          [
            17,
            '24'
          ],
          [
            26,
            '47'
          ],
          [
            53,
            '59'
          ]
        ];
[download]

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

[reply]
[d/l]

Re: Re: Overlapping portions of sub strings

by matth (Monk) on Jan 16, 2003 at 10:32 UTC

This is quite hard to understand. If you have time, could you please explain this process.

[reply]

Re: Re: Re: Overlapping portions of sub strings

by BrowserUk (Patriarch) on Jan 16, 2003 at 12:15 UTC

I've attempted to structure this so that what gets displayed when you use the d/l code link at the bottom, will be a working program, slightly modified from the original. I won't know if it worked until I submit it.

Hopefully, this is a slightly clearer version than the original.

#! perl -slw
use strict;
[download]

OK, the basic mechanism is this:

Given the input ranges

my @ranges = (
    [17, 19], [34, 39], [26, 29], [53, 57], [43, 47], [58, 59], [40, 4
+5],
    [30, 33], [20, 24], [10, 15], [ 6,  9], [ 1,  4], [35, 45], [ 7, 1
+5],
);
[download]

If we sort these numerically and lay them out graphically, we can easily project the 1's down vertically and see the resultant ranges on the bottom line. That's exactly what my code is doing.

=pod comment
                  1         2         3         4         5         6
Range     123456789012345678901234567890123456789012345678901234567890
[ 1, 4]  1111
[ 6, 9]       1111
[ 7,15]        111111111
[10,15]           111111
[17,19]                  111
[20,24]                     11111
[26,29]                           1111
[30,33]                               1111
[34,39]                                   111111
[35,45]                                    11111111111
[40,45]                                         111111
[43,47]                                            11111
[53,57]                                                      11111
[58,59]                                                           11

         1111 1111111111 11111111 1111111111111111111111     1111111
Coaleced
ranges   [1-4] [6-15]     [17-24]  [26-47]                    [53-59]

#! initialise a string  of null as long as the string the ranges perta
+in to.
my $coal = "\0" x 100; #! Only need be 60 in ths example but the exces
+s bytes don't effect the results.

Using substr as an lvalue, assign a string of range[high]-range[low] x
+ '1' over the null bytes if the result string.

#! fill all the ranges with 1's
substr($coal, $_->[0], $_->[1] - $_->[0] + 1) = 1 x ($_->[1] - $_->[0]
+ + 1) for @ranges;
=cut
[download]

That might have been clearer written as


#! Only need be 60 in this example but excess bytes don't effect the r
+esults.
my $result = "\0" x 100;

use constant LOW    =>0;
use constant HIGH    =>1;
for my $range (@ranges) {
    substr( $result, $range->[LOW], $range->[HIGH] - $range->[LOW] +1 
+)
    =       '1' x                  ($range->[HIGH] - $range->[LOW] +1 
+)
}
[download]


#! Build the new AoA of ranges by scanning the string records starts a
+nd length of string of 1's
my (@results, $start);
for (0 .. length $result) { #! Looking at each character in the result
+ string in turn
    my $c = substr($result, $_, 1); #! Grab the char

    #! skip to the next char until we find a 1.
    next if not $start and $c ne '1';

    #! Record the start position if we found a 1 and we haven't alread
+y got a start.
    $start = $_ if $c eq '1' and not $start;

    #! Skip to the next char if we have start pos until we find the en
+d (ne '1')
    next if $start and $c eq '1';

    #! We have a start and we found a non-'1', so we have an end...
    #! so push the start end pair onto the results array
    push @results, [$start, $_-1];

    #! and undef $start to continue the search.
    $start = undef;
}

#! Print out the inputs sorted for human consumption, though that isn'
+t necessary of the algorithm
print map{local $"='-'; "[@$_] "} sort{ $a->[0] <=> $b->[0] } @ranges;
+  #!"
print '';

#! And the results
print map{local $"='-'; "[@$_] "}@results; #!"
print '';

__END__
c:\test>227155-3
[1-4] [6-9] [7-15] [10-15] [17-19] [20-24] [26-29] [30-33] [34-39] [35
+-45] [40-45] [43-47] [53-57] [58-59]

[1-4] [6-15] [17-24] [26-47] [53-59]


c:\test>
[download]

Note: If your string are of really extreme length and space at a premium, you could use a bit-string instead of bytes ("\0" and "1"), by using vec inplace of substr but I don't get on with vec, and my test string is short.

BTW. If you don't use one yet, the code is a lot clearer to read in a syntax highlighting editor than b&w. Though that's a matter of taste and configuration.

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

[reply]
[d/l]
[select]

Re: Re: Re: Overlapping portions of sub strings

by matth (Monk) on Jan 16, 2003 at 11:36 UTC

Sorry. The message above is addressed to BrowserUk.

[reply]

Re: Re: Overlapping portions of sub strings

by matth (Monk) on Jan 16, 2003 at 10:18 UTC

I posted the original question under �Anonymous�. What in fact I was talking about was start and end positions for genes within a genome. I wanted to merge overlapping CDS (portions of gene) regions of a genome. Maybe there are too many figures to fit in an array for your method to work with some large data sets. The method I chose was to generate a list in which each line of the list contained the start and end positions of a single CDS region. The list was first ordered by start position and then by end position. New lists were produced recursively. On each recursion, where overlapping was found between a set of two start and end positions, a single line was produced in the new list where a merged region was represented by a start and end position. The reason I said that this was not full proof was because I put a limit on the number of recursions that could take place (in order to limit the number of files generated). I guess that my method could also offer advantages in that line tagging could be used to deal with more complex merging scenarios. Maybe, for example, yellow portions of string could be merged with blue portions but not red.

Does anyone know of how best these methods can be represented mathematically?

[reply]

Re: Re: Re: Overlapping portions of sub strings

by BrowserUk (Patriarch) on Jan 16, 2003 at 16:01 UTC

The AoA's is not a fundemental part of the algorithm, The version below is nearly identical except fro it reads the offset pairs from <DATA>, one pair per line instead of from an AoA.

The only memory constraint then becomes the length of the string $result, which is defined (though not automatically detected) by the largest offset.

As the process uses just one pass of two simple loops--one to read the offsets and build the result string, one to read the results string and extract the combined offsets--it should be considerably faster than any algorithm that use multiply recursive passes. You don't even need to sort the input offsets, and it outputs the combined offsets already sorted.

Hope it proves of some use.

Slightly modified code