Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
Do any of you know of a method by which any overlapping, adjoining or self containing start and end positions for coloured portions of a string can be merged to produce new start and end positions. I have developed a method that in practice works but in theory could allow for mistakes. ie it is not full proof. Curiosity leads me to ask how easily a full proof method can be coded.
|
|---|
| Replies are listed 'Best First'. | ||
|---|---|---|
|
Re: Overlapping portions of sub strings
by bart (Canon) on Jan 15, 2003 at 17:38 UTC | ||
First, extract the start and end index for each range, two numbers (end >= start) for each range. Next, see if there is any overlap. That means that the start or end index of one range is inside, or touching an end, of another range, or vice versa. Actually, testing the reverse is easier: two ranges don't overlap, if the starting index of the second range comes after the end index of the first one. Once you identified two adjoint or overlapping ranges, merge them: start is at the lowest starting index, end is at the highest end index. Be careful to restart all tests with this new range, as the new incarnation of this range now larger than it originally was, and it could overlap with ranges it didn't even touch before. I'll try for some code, for size. First I order the ranges according to start index. Next I compare the start index for all ranges after the current one, to see if they overlap. If they do, I incorporate it into this one, delete the other one, and start all tests again for the current range. I think that should cover it.
Result: which looks about right to me. | [reply] [d/l] [select] | |
by Thelonius (Priest) on Jan 15, 2003 at 20:57 UTC | ||
But it is also buggy for contained intervals. The input my @range4 = ([0, 20], [5, 8]); will give the output interval ([0, 8]). Here is the complete correct code:
| [reply] [d/l] [select] | |
by bart (Canon) on Jan 15, 2003 at 21:54 UTC | ||
But in fact, your code is simpler, thus, better. But that of the infinite loop surprised me, a lot. I had assumed that the condition, the second expression in the for(;;) line, was being tested before any attempt at running the loop body, for each loop. Read more... (561 Bytes)
But apparently, something goes wrong if you do a redo:
Read more... (3 kB)
| [reply] [d/l] [select] | |
|
Re: Overlapping portions of sub strings
by BrowserUk (Patriarch) on Jan 16, 2003 at 00:03 UTC | ||
I don't quite understand what you mean by "coloured portions of a string"? What color would a peice of the string be if a red range wholey contained a yellow portion? Does the yellow become red or does the red range get split into two red ranges either side of the yellow range? Anyway, assuming that the ranges are held in a AoA, this version should work pretty quickly and is really simple.
Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] | |
by matth (Monk) on Jan 16, 2003 at 10:32 UTC | ||
| [reply] | |
by BrowserUk (Patriarch) on Jan 16, 2003 at 12:15 UTC | ||
I've attempted to structure this so that what gets displayed when you use the d/l code link at the bottom, will be a working program, slightly modified from the original. I won't know if it worked until I submit it. Hopefully, this is a slightly clearer version than the original.
OK, the basic mechanism is this: Given the input ranges
If we sort these numerically and lay them out graphically, we can easily project the 1's down vertically and see the resultant ranges on the bottom line. That's exactly what my code is doing.
That might have been clearer written as Then, it's just a matter of scanning allong the string linearly and recording the starts and stops of the contguous runs of '1''s.
Note: If your string are of really extreme length and space at a premium, you could use a bit-string instead of bytes ("\0" and "1"), by using vec inplace of substr but I don't get on with vec, and my test string is short. BTW. If you don't use one yet, the code is a lot clearer to read in a syntax highlighting editor than b&w. Though that's a matter of taste and configuration. Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] [select] | |
by matth (Monk) on Jan 16, 2003 at 11:36 UTC | ||
| [reply] | |
by matth (Monk) on Jan 16, 2003 at 10:18 UTC | ||
I posted the original question under ‘Anonymous’. What in fact I was talking about was start and end positions for genes within a genome. I wanted to merge overlapping CDS (portions of gene) regions of a genome. Maybe there are too many figures to fit in an array for your method to work with some large data sets. The method I chose was to generate a list in which each line of the list contained the start and end positions of a single CDS region. The list was first ordered by start position and then by end position. New lists were produced recursively. On each recursion, where overlapping was found between a set of two start and end positions, a single line was produced in the new list where a merged region was represented by a start and end position. The reason I said that this was not full proof was because I put a limit on the number of recursions that could take place (in order to limit the number of files generated). I guess that my method could also offer advantages in that line tagging could be used to deal with more complex merging scenarios. Maybe, for example, yellow portions of string could be merged with blue portions but not red. Does anyone know of how best these methods can be represented mathematically? | [reply] | |
by BrowserUk (Patriarch) on Jan 16, 2003 at 16:01 UTC | ||
The AoA's is not a fundemental part of the algorithm, The version below is nearly identical except fro it reads the offset pairs from <DATA>, one pair per line instead of from an AoA. The only memory constraint then becomes the length of the string $result, which is defined (though not automatically detected) by the largest offset. As the process uses just one pass of two simple loops--one to read the offsets and build the result string, one to read the results string and extract the combined offsets--it should be considerably faster than any algorithm that use multiply recursive passes. You don't even need to sort the input offsets, and it outputs the combined offsets already sorted. Hope it proves of some use. Slightly modified code Read more... (925 Bytes)
Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] | |