drblove27 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

So in attempting to become a totally awesome regexer, I have run into a problem that I am hoping you can assist me with.

Specifically I have the following problem, I have text patterns and strings that I want to count the patterns in, but I want to allow for overlapping patterns in the string. I have hunted around these forums and found some good advice, but I have an additional problem. Here is what I can get:

my $first_counter = 0; my $second_counter = 0; my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; # This is what I initially did, but did not get the right answer $first_counter++ while $sequence =~/AAAAA/g; #Here I get 3 # Because I want to count the overlapping AAAAA in the sequence I hunt +ed around and found this solution $sequence =~ /AAAAA(?{$second_counter++})(?!)/; # This gives the right + answer of 11 print "First counter: $first_counter\n"; print "Second counter: $second_counter\n";
Now what I really want to do is that instead of putting in the pattern in the search, I would like to pass it as a variable, i.e. something like:
my @real_count = (0,0,0,0); my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; my @pattern; $pattern[0] = "AAAAA"; $pattern[1] = "GGGGG"; $pattern[2] = "GGAGA"; $pattern[3] = "GAAGG"; for (my $i=0; $i <= 3; $i++) { $sequence =~ /$pattern[$i](?{$real_count[$i]++})(?!)/; } foreach (@real_count) { print "$_\n"; }
When I run the above I get the following error message (which I admit I do not understand):

"Eval-group not allowed at runtime, use re 'eval' in regex m/AAAAA(?{$real_count$i++})(?!)/ at E:\Bioreka\Test\multimatch.pl line 13."

Can anyone help point me to how to address this error? I would appreciate it. Thanks in advance.

Replies are listed 'Best First'.
Re: Regex KungFu help needed
by kennethk (Abbot) on Oct 02, 2009 at 14:49 UTC
    Rather than using such advanced approaches, you can allow overlapping regular expressions using Look Around Assertions. Specifically, match on the first letter, and require it be followed by the others of interest. You can then use standard techniques for counting matches.

    #!/usr/bin/perl use strict; use warnings; my @real_count = (0,0,0,0); my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; my @pattern; $pattern[0] = "A(?=AAAA)"; $pattern[1] = "G(?=GGGG)"; $pattern[2] = "G(?=GAGA)"; $pattern[3] = "G(?=AAGG)"; foreach my $i (0..$#pattern) { $real_count[$i]++ while ($sequence =~ /$pattern[$i]/g); } foreach (@real_count) { print "$_\n"; }

    Note I also swapped your error prone for loop for a foreach loop with the range operator.

      You can put the whole term in the look-ahead to make things a bit simpler and you could take advantage of the $scalar = () = $string =~ m{$pattern}g; idiom rather than successive incrementing, wrapping the whole thing in a map.

      $ perl -Mstrict -wle ' > my $seq = q{GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA}; > my @pats = qw{ AAAAA GGGGG GGAGA GAAGG }; > my @cts = map { > my $re = qr{(?=\Q$_\E)}; > my $ct = () = $seq =~ m{$re}g; > } @pats; > print qq{@cts};' 11 3 1 1 $

      I hope this is of interest.

      Cheers,

      JohnGG

        As a further step, associating patterns with their counts and (cached) regex objects in a hash may be worthwhile:
        >perl -wMstrict -le "my $sequence = 'GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA'; my %patterns = map { $_ => { count => 0, regex => qr{ (?= \Q$_\E) }xms } } qw(AAAAA GGGGG GGAGA GAAGG) ; $patterns{$_}{count} =()= $sequence =~ m{ $patterns{$_}{regex} }xmsg for keys %patterns; print qq{$_: $patterns{$_}{count}} for sort keys %patterns; " AAAAA: 11 GAAGG: 1 GGAGA: 1 GGGGG: 3
        or
        >perl -wMstrict -le "my $sequence = 'GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA'; my %patterns = map { $_ => { count => 0, regex => qr{ (?= \Q$_\E) }xms } } qw(AAAAA GGGGG GGAGA GAAGG) ; $_->{count} =()= $sequence =~ m{ $_->{regex} }xmsg for values %patterns; print qq{$_: $patterns{$_}{count}} for sort keys %patterns; " AAAAA: 11 GAAGG: 1 GGAGA: 1 GGGGG: 3
      Huh, this is interesting, I will play around with this a bit more. Of course my @patterns are generated on the fly by a little substituion regex would get me what you have done with the (?=...) inside the patterns... Thanks for showing me something totally new.
Re: Regex KungFu help needed
by moritz (Cardinal) on Oct 02, 2009 at 14:27 UTC
    Can anyone help point me to how to address this error?

    Yes. Read the error message. Read it again. It tells you how to deal with the error.

    Also consider reading the documentation of the re package.

    Perl 6 - links to (nearly) everything that is Perl 6.
      Rather poorly, though. It's far from obvious "re 'eval'" is literal code, and that it should be preceded by an unstated "use". What the message means is that you should use:
      use re 'eval';

      re

Re: Regex KungFu help needed
by ELISHEVA (Prior) on Oct 02, 2009 at 14:57 UTC

    To insert a literal string into a pattern you can use \Q and \E. This insures that any characters that have special meaning in a regex will be escaped and treated as literals.

    As for finding all overlapping patterns, see the documentation of pos and the 0 length lookahead operator in perlre. Using pos, 0 length lookahead regexes (look for (?=) might be preferable since the documentation says that (?{...}) is experimental.

    use strict; use warnings; my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; sub tryPattern { my $pattern = shift; my $iCount = 0; while ($sequence =~ /(?=\Q$pattern\E)/g) { $iCount++; my $iPos = pos($sequence); my $sMatch = substr($sequence, $iPos, 5); print "$iPos: $sMatch\n"; pos($sequence) = $iPos + 1; } print "total found: $iCount\n"; } tryPattern($_) foreach qw(AAAAA GGGGG GGAGA GAAGG);

    Best, beth

Re: Regex KungFu help needed
by BioLion (Curate) on Oct 02, 2009 at 14:51 UTC

    As moritz pointed out use re 'eval' will solve your problem :

    use warnings; use strict; use re 'eval'; my @real_count = (0,0,0,0); my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; my @pattern; $pattern[0] = "AAAAA"; $pattern[1] = "GGGGG"; $pattern[2] = "GGAGA"; $pattern[3] = "GAAGG"; for (my $i=0; $i <= 3; $i++) { $sequence =~ /$pattern[$i](?{$real_count[$i]++})(?!)/; } foreach (@real_count) { print "$_\n"; ## prints ## 11 ## 3 ## 1 ## 1 }

    It is totally personal preference, but I think i prefer modifying pos for finding overlapping matches :

    use warnings; use strict; my @real_count = (0,0,0,0,); my $sequence = "GGGGGGGAGAAAAAAAAAAAAAAAGAAGGA"; my @patterns = qw/AAAAA GGGGG GGAGA GAAGG/; for my $i ( 0 .. $#patterns ) { while ( $sequence =~ m/$patterns[$i]/g ){ $real_count[$i]++; ## reset start position for next global match search pos($sequence) -= (length$patterns[$i]) -1; } } foreach (@real_count) { print "$_\n"; ## prints ## 11 ## 3 ## 1 ## 1 }

    I guess this is mainly a maintainability thing, because being a regex whizz is one thing, but gods help whoever has to maintain the code after you! If you are worried about which is faster (i guess you are not just matching 5 base patterns against 30 or so nucleotides) then there is a lot of info in the monastery about Benchmarking.

    Just a something something...