Warning: This post contains a lot of speculation and a little code. Basically, reading between the lines of your perl, I conclude that there is no need to performance tune your code because the outcome, your $permutations_total can be approximated for any given combination of input 'start sequence' and set of regexes by performing a probablility calculation.

I conclude this on the basis that you are not matching against an actual sequence, but against (a large number) of random permutations of that actual sequence. Therefore, you should be able to predict the outcome using statistics, without actually performing the task, which would be hugely more efficient. If its true. Ignire the rest of this post if I missed the mark again.

After giving this some thought, I think that you (and we:) are approaching this from the wrong direction. The thing I noticed was that the sample regexes you supplied all match exactly 7 characters. I'm not sure if it is always 7 characters, or if all your regex are the same length, but the important thing is that you are probably always looking for a sequence of adjacent characters. If this is so, the you can probably speed the process up.

What your asking the regex negine to do when you match a long string against a regex that is fixed length is to say

  1. Do the first n-chars of the string match the regex?
  2. Drop the first char. Add the n+1 char to the end.
  3. Does it match now?
  4. Goto step 2 for each of the 49,993 x 7-char substrings of your 50k char string.

And then you repeat this process for each of the other 23 regexes. And then repeat that for each of your random permutations.

What first struck me was that rather than building 100_000 10-50hk random strings, and asking the regex engine to search each of these 24 times, that you should turn the thing around. Build a random 7-char string and match that against each of the 24 regexes. Then drop the first char and tack a new random char on the end. Rinse and repeat until your requirements are satisfied.

Benchmark to test the efficiency of this approach

#! perl -slw use strict; use vars qw[$N]; use Benchmark::Timer; my $T= new Benchmark::Timer; $|++; $N ||= 10; # The number of random permutations of each string to analy +se. ##my $start-seq = <STARTS>; ## No need to count frequencies just to build any array. ##my @bases = split//, $start_seq; # Simulate the above two lines for testing my @bases = map{ qw[A C G T N][ rand 5 ] } 1..50_000; my $perms_count = 0; for( 1 .. $N ) { # Pick our first random 7 char string my $sub7= join'', @bases[ map{ rand @bases } 1 .. 7 ]; my $count = @bases - 7; # No. of 7 chars string to match a +gainst. $T->start( 'search' ); while( $count-- ) { printf "\r%d", $count; # print 'Checking: ', $sub7; #Unroll the match loop $perms_count++ if $sub7 =~ qr '[NT][GCATN][NG][NC][NG][NT][NG] +'; $perms_count++ if $sub7 =~ qr '[NG][NT][NG][NC][NG][GCATN][NT] +'; $perms_count++ if $sub7 =~ qr '[NA][GCATN][NC][NG][NC][NA][NC] +'; $perms_count++ if $sub7 =~ qr '[NC][NA][NC][NG][NC][GCATN][NA] +'; $perms_count++ if $sub7 =~ qr '[NT][GCATN][NG][NC][NG][NT][NG] +'; $perms_count++ if $sub7 =~ qr '[NG][NT][NG][NC][NG][GCATN][NT] +'; $perms_count++ if $sub7 =~ qr '[NA][GCATN][NC][NG][NC][NA][NC] +'; $perms_count++ if $sub7 =~ qr '[NC][NA][NC][NG][NC][GCATN][NA] +'; $perms_count++ if $sub7 =~ qr '[NT][GCATN][NG][NC][NG][NT][NG] +'; $perms_count++ if $sub7 =~ qr '[NG][NT][NG][NC][NG][GCATN][NT] +'; $perms_count++ if $sub7 =~ qr '[NA][GCATN][NC][NG][NC][NA][NC] +'; $perms_count++ if $sub7 =~ qr '[NC][NA][NC][NG][NC][GCATN][NA] +'; $perms_count++ if $sub7 =~ qr '[NT][GCATN][NG][NC][NG][NT][NG] +'; $perms_count++ if $sub7 =~ qr '[NG][NT][NG][NC][NG][GCATN][NT] +'; $perms_count++ if $sub7 =~ qr '[NA][GCATN][NC][NG][NC][NA][NC] +'; $perms_count++ if $sub7 =~ qr '[NC][NA][NC][NG][NC][GCATN][NA] +'; $perms_count++ if $sub7 =~ qr '[NT][GCATN][NG][NC][NG][NT][NG] +'; $perms_count++ if $sub7 =~ qr '[NG][NT][NG][NC][NG][GCATN][NT] +'; $perms_count++ if $sub7 =~ qr '[NA][GCATN][NC][NG][NC][NA][NC] +'; $perms_count++ if $sub7 =~ qr '[NC][NA][NC][NG][NC][GCATN][NA] +'; $perms_count++ if $sub7 =~ qr '[NT][GCATN][NG][NC][NG][NT][NG] +'; $perms_count++ if $sub7 =~ qr '[NG][NT][NG][NC][NG][GCATN][NT] +'; $perms_count++ if $sub7 =~ qr '[NA][GCATN][NC][NG][NC][NA][NC] +'; $perms_count++ if $sub7 =~ qr '[NC][NA][NC][NG][NC][GCATN][NA] +'; # Drop the first char and tack a new one on the end. $sub7 = substr( $sub7, 1 ) . $bases[ rand @bases ]; } $T->stop( 'search' ); } print 'Found: ', $perms_count, ' matches in'; $T->report;

That seems to run pretty quickly taking around 30 seconds to check the 49,997 7-chars substrings in each randomly generated 50,000 string, against each of 24 regexes. I'm not sure how this compares with letting the regex engine do all the work, but it seems like it might be quicker.

The next logical step...

... would be to remove the regexes from the picture completely by permuting each of them to produce each of the possible sequences that they can match and then use index on the unique ones. Eg.

The regex /[NT][GCATN][NG][NC][NG][NT][NG]/ can match (2 * 5 * 2 * 2 * 2 * 2 *2) = 320 x 7-chars sequences.

And /[NG][NT][NG][NC][NG][GCATN][NT]/ can match (2 * 2 * 2 * 2 * 2 * 5 * 2) = 320 x 7-char sequences.

However, whilst the 320 sequences are unique within each group, if you combine them instead of 640, you get 608. And if you go through the process for the 4 example regex that you provided instead of 1280, you end up with only 1205 unique 7-char sequences. Not a big saving, but worth having when your iterating 100,000 times:).

The analysis of the regex, and and a count of the number of unique 7-char sequences they would match can be quickly performed using this program

#! perl -slw use strict; sub p{ my @a = split '', shift; while( @_ ) { my @c= split '', shift; my @b=(); for my $c ( @c ) { push @b, $_ . $c for @a ; } @a = @b; } return @a; } my @combs = ( p( 'NT', 'GCATN', 'NG', 'NC', 'NG', 'NT', 'NG' ), p( 'NG', 'NT', 'NG', 'NC', 'NG', 'GCATN', 'NT' ), p( 'NA', 'GCATN', 'NC', 'NG', 'NC', 'NA', 'NC' ), p( 'NC', 'NA', 'NC', 'NG', 'NC', 'GCATN', 'NA' ), ); my %seen; @seen{ @combs } = (); print "The number of unique 7-char sequences that can be matched by th +e regexes is:", scalar keys %seen;

Then I got to thinking about what you are actually measuring.

What are you actually measuring?

You're counting how many times any of the 7-byte sequences turn up in a large number of random combinations of strings that you generate with the frequency of each character weighted according to the weights of your original 'start' sequence.

What you appear to be doing is a statistical analysis of the chance that any of a given set of 7-byte sequences (as represented by your 24 regex) is likely to appear in any particular start_sequence of 50k chars.

No need to search

It was then that it struck me that you don't need to do any serching at all!

Assuming (that you are assuming:), that the random number generator is "truely random", then what you are doing can be reduced to a simple(*) probability calculation.

(*)Now, I say simple, but I will admit that my math is not sufficient to work out the formula required to calculated the probabllity in question, but I do know that there are others here that could. The formula should be somethng along the lines of:

Crass at math, (especially stats:()

Given a string of 50k chars, that contain the chars ('A', 'C', 'G', 'T', 'N') in the proportions (a, c, g, t, n) respectively, then there at P= 7P5 possible 7-char combinations. Given these 24 regexes, that can match 24 * 320 = 7680 (minus a few for non-unique overlap), the probability that the 50k string will contain each of those 7680(-a few) is 1-(P/M)???

Formula wanted

Now that's a long way from being the correct formula, but I get lost in the math when you through in the weighting factors, but I'm sure that this could be reduced such a formula.

All you would need is the weights of the characters plus an analysis of the regexes and your average permutations count should be directly derivable.

Over to someone else for the real math, always assuming any of what I've written is applicable to your desired goal.

Sorry for the spelling, grammer and typos, but now I've got a headache, so I'm going to lie down:)


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller



In reply to Re: Performance Tuning: Searching Long-Sequence Permutations (Use math not code?) by BrowserUk
in thread Performance Tuning: Searching Long-Sequence Permutations by Itatsumaki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.