in reply to Re^15: list of unique strings, also eliminating matching substrings
in thread list of unique strings, also eliminating matching substrings

I don't get the point why continuing by comparing the last byte is better than comparing the second byte.

If you want a real boost you should consider another data-structure, instead of \0 as separator encode the length of the coming snippet. like this you can avoid tests which cross borders and you can take the value as offset to the next snippet.

In other words when checking if a 300 byte snippet is included within a 400 byte one you only need to check 100 possible slides.

Cheers Rolf

  • Comment on Re^16: list of unique strings, also eliminating matching substrings

Replies are listed 'Best First'.
Re^17: list of unique strings, also eliminating matching substrings
by BrowserUk (Patriarch) on Jun 03, 2011 at 16:31 UTC
    I don't get the point why continuing by comparing the last byte is better than comparing the second byte.

    Because it detects when you can skip ahead. Ie. It exactly achieves, very efficiently, only checking 100 starts when comparing a 300 byte string against a 400 byte string. No other datastructure is required.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Argh ... I had already shut down my laptop when I understood what you mean ... well too late to update! :)

      Anyway when using C, my approach to start with the longest is for sure better and one can avoid allocation problems.

      But I'm sure there are already much more efficient algorithms out there (something like presorting the input in a dynamic tree) and there is no need to reinvent the wheel.

      And analyzing sequences is such a common task in bioinformatics, that I can hardly imagine that none of the BioPerl modules already include necessary routines. (just 50% inclusion redundancy is IMHO a very low estimation)

      Thats very likely an xy question, because normally the next step is to find overlapping endings to reconstruct the original DNA.

      Cheers Rolf

        But I'm sure there are already much more efficient algorithms out there

        No, there aren't.

        I can hardly imagine that none of the BioPerl modules already include necessary routines

        You've obviously never tried using BioPerl modules. Every one I've tried has been over-engineered, clumsy and dog slow. And those are the good ones, and many are not.

        And I'll make you a bet you won't take up.

        I bet that you cannot post a BioPerl solution to the OPs problem exactly as stated.

        Much less one that will perform better than my solution.

        I'm sure ... no need to reinvent the wheel. I can hardly imagine ... IMHO ... very likely

        Based upon your extensive experience of all things BioInformatic, and your uncanny ability to second guess the true nature of the OPs task I take it.

        IMO, you are wrong on all counts. And until the OP comes and sets the record straight, that is what I will choose to believe.

        Just as I will continue to enjoy solving the problems OPs post. Because I enjoy the challenge.

        For the record: I find your--if, but, maybe Xy, someone else has probably already solved it if only you changed the problem to suit their solution--attitude, arrogant, patronising and completely unhelpful. Oh And decidedly off-topic.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        A reply falls below the community's threshold of quality. You may see it by logging in.