Regex KungFu help needed

drblove27 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

So in attempting to become a totally awesome regexer, I have run into a problem that I am hoping you can assist me with.

Specifically I have the following problem, I have text patterns and strings that I want to count the patterns in, but I want to allow for overlapping patterns in the string. I have hunted around these forums and found some good advice, but I have an additional problem. Here is what I can get:

my $first_counter = 0;
my $second_counter = 0;

my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA";

# This is what I initially did, but did not get the right answer

$first_counter++ while $sequence =~/AAAAA/g; #Here I get 3

# Because I want to count the overlapping AAAAA in the sequence I hunt
+ed around and found this solution

$sequence =~ /AAAAA(?{$second_counter++})(?!)/; # This gives the right
+ answer of 11

print "First counter: $first_counter\n";
print "Second counter: $second_counter\n";
[download]

Now what I really want to do is that instead of putting in the pattern in the search, I would like to pass it as a variable, i.e. something like:

my @real_count = (0,0,0,0);
my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA";
my @pattern;
$pattern[0] = "AAAAA";
$pattern[1] = "GGGGG";
$pattern[2] = "GGAGA";
$pattern[3] = "GAAGG";

for (my $i=0; $i <= 3; $i++) {
    $sequence =~ /$pattern[$i](?{$real_count[$i]++})(?!)/;
}

foreach (@real_count) {
    print "$_\n";
}
[download]

When I run the above I get the following error message (which I admit I do not understand):

"Eval-group not allowed at runtime, use re 'eval' in regex m/AAAAA(?{$real_count$i++})(?!)/ at E:\Bioreka\Test\multimatch.pl line 13."

Can anyone help point me to how to address this error? I would appreciate it. Thanks in advance.

Comment on Regex KungFu help needed Select or Download Code

Replies are listed 'Best First'.
Re: Regex KungFu help needed by kennethk (Abbot) on Oct 02, 2009 at 14:49 UTC
Rather than using such advanced approaches, you can allow overlapping regular expressions using Look Around Assertions. Specifically, match on the first letter, and require it be followed by the others of interest. You can then use standard techniques for counting matches. `#!/usr/bin/perl use strict; use warnings; my @real_count = (0,0,0,0); my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; my @pattern; $pattern[0] = "A(?=AAAA)"; $pattern[1] = "G(?=GGGG)"; $pattern[2] = "G(?=GAGA)"; $pattern[3] = "G(?=AAGG)"; foreach my $i (0..$#pattern) { $real_count[$i]++ while ($sequence =~ /$pattern[$i]/g); } foreach (@real_count) { print "$_\n"; }` [download] Note I also swapped your error prone for loop for a foreach loop with the range operator.	[reply] [d/l]
Re^2: Regex KungFu help needed by johngg (Canon) on Oct 02, 2009 at 15:12 UTC
You can put the whole term in the look-ahead to make things a bit simpler and you could take advantage of the `$scalar = () = $string =~ m{$pattern}g;` idiom rather than successive incrementing, wrapping the whole thing in a map. `$ perl -Mstrict -wle ' > my $seq = q{GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA}; > my @pats = qw{ AAAAA GGGGG GGAGA GAAGG }; > my @cts = map { > my $re = qr{(?=\Q$_\E)}; > my $ct = () = $seq =~ m{$re}g; > } @pats; > print qq{@cts};' 11 3 1 1 $` [download] I hope this is of interest. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: Regex KungFu help needed by AnomalousMonk (Archbishop) on Oct 02, 2009 at 22:59 UTC
As a further step, associating patterns with their counts and (cached) regex objects in a hash may be worthwhile: `>perl -wMstrict -le "my $sequence = 'GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA'; my %patterns = map { $_ => { count => 0, regex => qr{ (?= \Q$_\E) }xms } } qw(AAAAA GGGGG GGAGA GAAGG) ; $patterns{$_}{count} =()= $sequence =~ m{ $patterns{$_}{regex} }xmsg for keys %patterns; print qq{$_: $patterns{$_}{count}} for sort keys %patterns; " AAAAA: 11 GAAGG: 1 GGAGA: 1 GGGGG: 3` [download] or `>perl -wMstrict -le "my $sequence = 'GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA'; my %patterns = map { $_ => { count => 0, regex => qr{ (?= \Q$_\E) }xms } } qw(AAAAA GGGGG GGAGA GAAGG) ; $_->{count} =()= $sequence =~ m{ $_->{regex} }xmsg for values %patterns; print qq{$_: $patterns{$_}{count}} for sort keys %patterns; " AAAAA: 11 GAAGG: 1 GGAGA: 1 GGGGG: 3` [download]	[reply] [d/l] [select]
Re^4: Regex KungFu help needed by johngg (Canon) on Oct 03, 2009 at 11:03 UTC
Re^5: Regex KungFu help needed by grizzley (Chaplain) on Oct 05, 2009 at 07:17 UTC
Re^2: Regex KungFu help needed by Anonymous Monk on Oct 02, 2009 at 18:25 UTC
Huh, this is interesting, I will play around with this a bit more. Of course my @patterns are generated on the fly by a little substituion regex would get me what you have done with the (?=...) inside the patterns... Thanks for showing me something totally new.	[reply]
Re: Regex KungFu help needed by moritz (Cardinal) on Oct 02, 2009 at 14:27 UTC
Can anyone help point me to how to address this error? Yes. Read the error message. Read it again. It tells you how to deal with the error. Also consider reading the documentation of the re package. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^2: Regex KungFu help needed by ikegami (Patriarch) on Oct 02, 2009 at 15:02 UTC
Rather poorly, though. It's far from obvious "`re 'eval'`" is literal code, and that it should be preceded by an unstated "`use`". What the message means is that you should use: `use re 'eval';` [download] re	[reply] [d/l] [select]
Re^3: Regex KungFu help needed by moritz (Cardinal) on Oct 02, 2009 at 15:19 UTC
I agree, it's poor. Which is why I just submitted a patch that hopefully improves the error message. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^4: Regex KungFu help needed by Anonymous Monk on Oct 02, 2009 at 17:18 UTC
Re: Regex KungFu help needed by ELISHEVA (Prior) on Oct 02, 2009 at 14:57 UTC
To insert a literal string into a pattern you can use `\Q` and `\E`. This insures that any characters that have special meaning in a regex will be escaped and treated as literals. As for finding all overlapping patterns, see the documentation of `pos` and the 0 length lookahead operator in perlre. Using `pos`, 0 length lookahead regexes (look for `(?=`) might be preferable since the documentation says that `(?{...})` is experimental. `use strict; use warnings; my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; sub tryPattern { my $pattern = shift; my $iCount = 0; while ($sequence =~ /(?=\Q$pattern\E)/g) { $iCount++; my $iPos = pos($sequence); my $sMatch = substr($sequence, $iPos, 5); print "$iPos: $sMatch\n"; pos($sequence) = $iPos + 1; } print "total found: $iCount\n"; } tryPattern($_) foreach qw(AAAAA GGGGG GGAGA GAAGG);` [download] Best, beth	[reply] [d/l] [select]
Re: Regex KungFu help needed by BioLion (Curate) on Oct 02, 2009 at 14:51 UTC
As moritz pointed out `use re 'eval'` will solve your problem : `use warnings; use strict; use re 'eval'; my @real_count = (0,0,0,0); my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; my @pattern; $pattern[0] = "AAAAA"; $pattern[1] = "GGGGG"; $pattern[2] = "GGAGA"; $pattern[3] = "GAAGG"; for (my $i=0; $i <= 3; $i++) { $sequence =~ /$pattern[$i](?{$real_count[$i]++})(?!)/; } foreach (@real_count) { print "$_\n"; ## prints ## 11 ## 3 ## 1 ## 1 }` [download] It is totally personal preference, but I think i prefer modifying pos for finding overlapping matches : `use warnings; use strict; my @real_count = (0,0,0,0,); my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; my @patterns = qw/AAAAA GGGGG GGAGA GAAGG/; for my $i ( 0 .. $#patterns ) { while ( $sequence =~ m/$patterns[$i]/g ){ $real_count[$i]++; ## reset start position for next global match search pos($sequence) -= (length$patterns[$i]) -1; } } foreach (@real_count) { print "$_\n"; ## prints ## 11 ## 3 ## 1 ## 1 }` [download] I guess this is mainly a maintainability thing, because being a regex whizz is one thing, but gods help whoever has to maintain the code after you! If you are worried about which is faster (i guess you are not just matching 5 base patterns against 30 or so nucleotides) then there is a lot of info in the monastery about Benchmarking. Just a something something...	[reply] [d/l] [select]