in reply to Re^11: list of unique strings, also eliminating matching substrings
in thread list of unique strings, also eliminating matching substrings

ATM I can't see any influence of string allocation.

It's simple your version's performance is proportional to the input size, mine is proportional to output.

In rare cases where there are almost no strings to be excluded - i.e. output nearly input - my version can be slightly slower.

In other words if only 10% of all sequences remain after filtering, my algo is about 10 times faster.

BTW: using "\n" instead of chr(0) was a stupid idea, the extra byte is expensive.

Cheers Rolf

  • Comment on Re^12: list of unique strings, also eliminating matching substrings

Replies are listed 'Best First'.
Re^13: list of unique strings, also eliminating matching substrings
by BrowserUk (Patriarch) on Jun 03, 2011 at 15:20 UTC
    ATM I can't see any influence of string allocation.

    Try it with 200,000 strings where half are to be excluded, then you'll see the affects of allocating and copying 300 bytes; then allocating and copying 600 bytes & freeing 600; then allocating and copying 900 bytes & freeing 900; ... 99,995 allocs/copies/frees omitted ...; then allocating copying 29,999,700 bytes & freeing 29.9997MB; then allocating and copying 30,000,000 bytes & freeing 30MB.

    Each time you do $longest .= "\n" . $x; perl has allocate a new chuck of memory big enough to accommodate $longest + $x; then copy those two into the newly allocated space, then free both the originals. And as each freed allocation is not big enough to accommodate the next iteration of append, each new allocation (once you get past trivial amounts) requires Perl to go to the OS for a new chunk of virtual memory. And that gets very costly.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      allocating the maximum needed string right away and setting it to "" afterwards seems to help.
      DB<109> $w=""; $a="x" x 300; $t=time; $w.=$a for (1..200_000); prin +t time-$t, ":", length $w 9:60000000 DB<110> $x="x" x 60000000; $x=""; $a="x" x 300; $t=time; $x.=$a for +(1..200_000); print time-$t, ":", length $x 0:60000000

      Anyway 9 seconds are not that much...

      Calculating 200_000 on my limited system takes hours, especially if I start with your algo...maybe next night

      Cheers Rolf

        Anyway 9 seconds are not that much..

        See how it works out in the real thing.

        especially if I start with your algo...

        index isn't my algo, just a poor, pure perl substitute for it.

        If you look closely at the inline C version, longCmp() is not your average brute force string search. It uses several short circuits to avoid unnecessary comparisons.

        For example, it checks the last byte in the haystack up front.

        • First, there is no point in comparing all 300 bytes if the last bytes don't match.

          When longCmp() encounter this scenario:

          haystack:....0ACGTACGTACGT0.... needle : ACGTACGc

          The fact that the last bytes to be matched are different means it doesn't have to compare all the intermediates to discover than and can move forward to the next position.

        • Second, if the last byte that would be compared is 0, then not only is there no point in comparing these two, but we can skip ahead to the start of the next string in the haystack.

          That is, when longCmp() tries this comparison:

          haystack:....0ACGTACGTACGT0.... needle : AACCGGTT

          It sees that the last byte that would be compared is a null, and not only skips that comparison, but skips ahead to the start of the next string:

          haystack:....0ACGTACGTACGT0.... needle : AACCGGTT

        It isn't possible to code this kind of logic in perl efficiently, hence the inline C solution. My pure perl version was simply a poor substitute until the OP can sort out his Inline::C install.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.