Re^15: list of unique strings, also eliminating matching substrings

Anyway 9 seconds are not that much..

See how it works out in the real thing.

especially if I start with your algo...

index isn't my algo, just a poor, pure perl substitute for it.

If you look closely at the inline C version, longCmp() is not your average brute force string search. It uses several short circuits to avoid unnecessary comparisons.

For example, it checks the last byte in the haystack up front.

First, there is no point in comparing all 300 bytes if the last bytes don't match.
When longCmp() encounter this scenario:
```
haystack:....0ACGTACGTACGT0....
needle  :     ACGTACGc
[download]
```
The fact that the last bytes to be matched are different means it doesn't have to compare all the intermediates to discover than and can move forward to the next position.
Second, if the last byte that would be compared is 0, then not only is there no point in comparing these two, but we can skip ahead to the start of the next string in the haystack.
That is, when longCmp() tries this comparison:
```
haystack:....0ACGTACGTACGT0....
needle  :          AACCGGTT
[download]
```
It sees that the last byte that would be compared is a null, and not only skips that comparison, but skips ahead to the start of the next string:
```
haystack:....0ACGTACGTACGT0....
needle  :                  AACCGGTT
[download]
```

It isn't possible to code this kind of logic in perl efficiently, hence the inline C solution. My pure perl version was simply a poor substitute until the OP can sort out his Inline::C install.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^15: list of unique strings, also eliminating matching substrings Select or Download Code

Replies are listed 'Best First'.
Re^16: list of unique strings, also eliminating matching substrings by LanX (Saint) on Jun 03, 2011 at 16:05 UTC
I don't get the point why continuing by comparing the last byte is better than comparing the second byte. If you want a real boost you should consider another data-structure, instead of \0 as separator encode the length of the coming snippet. like this you can avoid tests which cross borders and you can take the value as offset to the next snippet. In other words when checking if a 300 byte snippet is included within a 400 byte one you only need to check 100 possible slides. Cheers Rolf	[reply]
Re^17: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on Jun 03, 2011 at 16:31 UTC
I don't get the point why continuing by comparing the last byte is better than comparing the second byte. Because it detects when you can skip ahead. Ie. It exactly achieves, very efficiently, only checking 100 starts when comparing a 300 byte string against a 400 byte string. No other datastructure is required. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^18: list of unique strings, also eliminating matching substrings by LanX (Saint) on Jun 03, 2011 at 17:03 UTC
Argh ... I had already shut down my laptop when I understood what you mean ... well too late to update! :) Anyway when using C, my approach to start with the longest is for sure better and one can avoid allocation problems. But I'm sure there are already much more efficient algorithms out there (something like presorting the input in a dynamic tree) and there is no need to reinvent the wheel. And analyzing sequences is such a common task in bioinformatics, that I can hardly imagine that none of the BioPerl modules already include necessary routines. (just 50% inclusion redundancy is IMHO a very low estimation) Thats very likely an xy question, because normally the next step is to find overlapping endings to reconstruct the original DNA. Cheers Rolf	[reply]
Re^19: list of unique strings, also eliminating matching substrings by BrowserUk (Patriarch) on Jun 03, 2011 at 17:46 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.