in reply to counting the number of 16384 pattern matches in a large DNA sequence

getting the count of 16384 patterns in a large string(sequence)

You say "16384 patterns", but you do not show us what those patterns look like?

You say "in a large sequence", but you are processing a hash of sequences. How many sequences are in the hash? How long are they?

You say "the code is very slow", without saying how slow, or what order of improvement you are seeking.

You are slowing your code down by chomping inside the inner loop, which will be redundant after the first time.

You are also studying the sequences -- a slow process, that may or may not be benefitting you -- inside the inner loop; which means that you will study each sequence once for each pattern. There is no benefit to studying a string more than once.

If you are serious about improving the performance of this code, you will need to supply considerably more information. A few sample patterns and a sample sequence -- are they just ACGT; or do they contains the full range of nucleic acid codes ACGTURYKMSWBDHVNX-; or the full range of amino acid codes: ABCDEFGHIJKLMNOPQRSTUVWYZX*-; ?

The better teh info you supply, the more likely that you will get an effective solution.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

  • Comment on Re: counting the number of 16384 pattern matches in a large DNA sequence

Replies are listed 'Best First'.
Re^2: counting the number of 16384 pattern matches in a large DNA sequence
by AnomalousMonk (Archbishop) on Jun 14, 2012 at 18:22 UTC
    There is no benefit to studying a string more than once.

    Just as a matter of curiosity, have you ever come across a case in which there was a benefit from studying a string even once?

    Occasionally and informally, I have Benchmark-ed a use of study that I thought might be beneficial per the description in the docs: matching many regexes against a single, unchanging string. I have never seen any benefit. (I must admit I have never tested the specific example given in the perlfunc docs for which a potential "big win" is claimed.) Has regex optimization reached a point at which we can say "... I ain't gonna study [strings] no more"?

      Just as a matter of curiosity, have you ever come across a case in which there was a benefit from studying a string even once?

      Yes. But it was

      • a while ago on much slower hardware than I now run.

        Boyer-Moore worked best in the days before the chip manufacturers had built string search algorithms into microcode; and before the advent of deeply pipelined architectures with branch predictions, when branching had a proportionally significantly greater impact on throughput than with modern hardware.

      • the needle being search for contained a character that was very rare in the haystack being searched.

        This allows the algorithm to trade intelligence for brute force.

        With modern processors that overlap fetches and branching stalls with other parts of the opcodes, the benefits are greatly curtailed.

      I tried to look up a old post of mine that benchmarked study having a beneficial effect, but couldn't find it.

      I also tried to replicate my memory of what I was doing back then, but cannot reproduce it. Whether that's because my memory is bad, or my modern processor negates the benefits I'm not totally sure, but I suspect the latter.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Re^2: counting the number of 16384 pattern matches in a large DNA sequence
by Anonymous Monk on Jun 14, 2012 at 19:08 UTC

    As of Perl 5.16.0: "study is now a no-op, presumably fixing all outstanding bugs related to study causing regex matches to behave incorrectly!"

    perldelta: Other Notable Fixes

    It's a shame that it's not mentioned in the POD for study directly, but really not much of a shame to see the function fade away. The few times I attempted to use it, benchmarking showed that I still hadn't found a good application for it.

      but really not much of a shame to see the function fade away

      Agreed. Whilst Boyer-Moore (and some of the others: Aho-Corasick; Rabin-Karp; etc.) can be beneficial for specific tasks and algorithms -- spam filters; virus detection -- they rarely live up to their promise for general purpose work. As such, the complications they add to the runtime aren't worth it for the rare occasions when they are beneficial. Especially on modern hardware with big caches and pipelines.

      For genomic purposes (with its often very small alphabet), it is quite trivial to setup a specialist indexing mechanism that shows far greater benefits.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?