Warning: This post contains a lot of speculation and a little code. Basically, reading between the lines of your perl, I conclude that there is no need to performance tune your code because the outcome, your $permutations_total can be approximated for any given combination of input 'start sequence' and set of regexes by performing a probablility calculation.
I conclude this on the basis that you are not matching against an actual sequence, but against (a large number) of random permutations of that actual sequence. Therefore, you should be able to predict the outcome using statistics, without actually performing the task, which would be hugely more efficient. If its true. Ignire the rest of this post if I missed the mark again.
After giving this some thought, I think that you (and we:) are approaching this from the wrong direction. The thing I noticed was that the sample regexes you supplied all match exactly 7 characters. I'm not sure if it is always 7 characters, or if all your regex are the same length, but the important thing is that you are probably always looking for a sequence of adjacent characters. If this is so, the you can probably speed the process up.
What your asking the regex negine to do when you match a long string against a regex that is fixed length is to say
- Do the first n-chars of the string match the regex?
- Drop the first char. Add the n+1 char to the end.
- Does it match now?
- Goto step 2 for each of the 49,993 x 7-char substrings of your 50k char string.
And then you repeat this process for each of the other 23 regexes. And then repeat that for each of your random permutations.
What first struck me was that rather than building 100_000 10-50hk random strings, and asking the regex engine to search each of these 24 times, that you should turn the thing around. Build a random 7-char string and match that against each of the 24 regexes. Then drop the first char and tack a new random char on the end. Rinse and repeat until your requirements are satisfied.
Benchmark to test the efficiency of this approach
That seems to run pretty quickly taking around 30 seconds to check the 49,997 7-chars substrings in each randomly generated 50,000 string, against each of 24 regexes. I'm not sure how this compares with letting the regex engine do all the work, but it seems like it might be quicker.
The next logical step...
... would be to remove the regexes from the picture completely by permuting each of them to produce each of the possible sequences that they can match and then use index on the unique ones. Eg.
The regex /[NT][GCATN][NG][NC][NG][NT][NG]/ can match (2 * 5 * 2 * 2 * 2 * 2 *2) = 320 x 7-chars sequences.
And /[NG][NT][NG][NC][NG][GCATN][NT]/ can match (2 * 2 * 2 * 2 * 2 * 5 * 2) = 320 x 7-char sequences.
However, whilst the 320 sequences are unique within each group, if you combine them instead of 640, you get 608. And if you go through the process for the 4 example regex that you provided instead of 1280, you end up with only 1205 unique 7-char sequences. Not a big saving, but worth having when your iterating 100,000 times:).
The analysis of the regex, and and a count of the number of unique 7-char sequences they would match can be quickly performed using this program
Then I got to thinking about what you are actually measuring.
What are you actually measuring?
You're counting how many times any of the 7-byte sequences turn up in a large number of random combinations of strings that you generate with the frequency of each character weighted according to the weights of your original 'start' sequence.
What you appear to be doing is a statistical analysis of the chance that any of a given set of 7-byte sequences (as represented by your 24 regex) is likely to appear in any particular start_sequence of 50k chars.
No need to search
It was then that it struck me that you don't need to do any serching at all!
Assuming (that you are assuming:), that the random number generator is "truely random", then what you are doing can be reduced to a simple(*) probability calculation.
(*)Now, I say simple, but I will admit that my math is not sufficient to work out the formula required to calculated the probabllity in question, but I do know that there are others here that could. The formula should be somethng along the lines of:
Crass at math, (especially stats:()
Given a string of 50k chars, that contain the chars ('A', 'C', 'G', 'T', 'N') in the proportions (a, c, g, t, n) respectively, then there at P= 7P5 possible 7-char combinations. Given these 24 regexes, that can match 24 * 320 = 7680 (minus a few for non-unique overlap), the probability that the 50k string will contain each of those 7680(-a few)
is 1-(P/M)???
Formula wanted
Now that's a long way from being the correct formula, but I get lost in the math when you through in the weighting factors, but I'm sure that this could be reduced such a formula.
All you would need is the weights of the characters plus an analysis of the regexes and your average permutations count should be directly derivable.
Over to someone else for the real math, always assuming any of what I've written is applicable to your desired goal.
Sorry for the spelling, grammer and typos, but now I've got a headache, so I'm going to lie down:)
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller