Take a closer look at the actual original strings. The second two strings match multiple substrings, so the desired output actually shows this, but in a sort-of-confusing way.
Take this example:
STRING
-------
GCGCTCGACGC
SUBSTRINGS
----------
GCGC ACG == [GCGC]TC[ACG]C
But, when we have OVERLAPPING sequences the output should 'mash-up' a bit:
STRING
-------
GCGCTCGACGC
SUBSTRINGS
----------
GCGC GCTC == [GCGCTC]GACGC
Do you see how GCGC AND GCTC MERGE into one single substring for the desired output?
So I think the algorithm should look like this:
Make as many straight matches as you can
If your match is within a string that has already been matched, modify that match to include the new match
Can you imagine how messy this would look if you had 100 substrings and a main string running 10,000 letters long (which I assume is possible because this stuff looks like gene sequence data)?
Celebrate Intellectual Diversity
|