http://qs1969.pair.com?node_id=1149753

BlueStarry has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

since a week, i'm studying the perlretut. I'm confused with an example:

while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? print "Got a TGA stop codon at position ", pos $dna, "\n"; } which prints Got a TGA stop codon at position 18 Got a TGA stop codon at position 23

On the regexp:   /(\w\w\w)*?TGA/g

can anybody explain me step by step the procedure that the regexp engine does with the provided string?

 $dna = "ATCGTTGAATGCAAATGACATGAC"

EDIT:

I was missing the fact that *? matches the empty string too, so even if at the start of the line a TGA is going to be counted. However i still don't understand why this is bugged without \G.

Replies are listed 'Best First'.
Re: Understanding a portion on the Perlretut
by Corion (Patriarch) on Dec 09, 2015 at 12:06 UTC

    perlretut also has prose text to go with the code. This also motivates why it uses (\w\w\w)*?, namely to progress through the string in triplets instead of trying to match at each character position.

      There is no such sentence on the explanation.

        I linked to perlretut. Going there, I find:

        The naive regexp

        ...

        doesn't work; it may match a TGA , but there is no guarantee that the match is aligned with codon boundaries, e.g., the substring GTT GAA gives a match. A better solution is

        while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? print "Got a TGA stop codon at position ", pos $dna, "\n"; }
        which prints
        Got a TGA stop codon at position 18 Got a TGA stop codon at position 23
        Position 18 is good, but position 23 is bogus. What happened?

        Maybe it was too obvious for me, but a Codon is a nucleotide triplet.

Re: Understanding a portion on the Perlretut
by Eily (Monsignor) on Dec 09, 2015 at 13:28 UTC

    Well of course I agree with Discipulus, because I do love debuggex :)

    Here is something that you can try to understand what happens:

    use v5.14; say "With the ?"; 'AATCGTTGAATGCAATGACATGAC' =~ / (\w\w\w)*? (?=(?{say "Checking if <$&> is followe +d by TGA"})) # Print everything that matched before that point TGA/x; say "Match: $&"; say "\nWithout the ?"; 'AATCGTTGAATGCAATGACATGAC' =~ / (\w\w\w)* (?=(?{say "Checking if <$&> is followe +d by TGA"})) # Print everything that matched before that point TGA/x; say "Match: $&";
    You don't have to understand how the second line of the regex works, it just prints debug information on the current state of the regex :). Do note that I have changed your sample input so that there are two different "TGA" at a multiple of three position.

    In both case, (\w\w\w)+ is a loop that reads three characters at a time. The difference is that in the first case, each times it reads three characters it lets the last part of the regex test the string (check if it is followed by TGA), if the test failed, three new characters are read and the test is ran again. The (\w\w\w)* loop of the second regex though, keeps reading characters as long are there are three characters to read, and it only lets the last part of the regex be checked after it is done, if the test fails, it goes back (backtracks) one iteration, and tries again.

    The /g simply memories the position of the last successful match, and starts reading from there on the next attempt.

Re: Understanding a portion on the Perlretut
by Discipulus (Canon) on Dec 09, 2015 at 12:13 UTC
    have you said step-by-step? are you sure?...;=)
    Davido's precious regex tester also onther testing site suggested by Eily also, anonimously suggested, rxrx and this one and this other one

    Without help i read something like:WRONG match 1,2 or 3 times \w (word char) prefering the minimum amount followed by TGA and capture the result in $1. capture in $1 the last triplet formed of \w followed by TGA

    HtH
    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Understanding a portion of perlretut
by Athanasius (Archbishop) on Dec 09, 2015 at 14:22 UTC

    I think the documentation is a little misleading here. At least, it gives me the impression that the first match (if any) is somehow guaranteed to be valid (because codon-aligned). But that’s true only if, as in the example given, the $dna string happens to contain a valid match somewhere — in which case, it will be found first. But if it doesn’t, the first match is an invalid one:

    #! perl use strict; use warnings; while (my $dna = <DATA>) { chomp $dna; print "\n\$dna = '$dna'\n"; while ($dna =~ /(\w\w\w)*?TGA/g) { print 'Got a TGA stop codon at position ', pos $dna, ', immediately following [', $1, "]\n"; } } __DATA__ ATCGTTGAA ATCGTTGAATGCAAATGACATGAC

    Output:

    0:10 >perl 1476_SoPW.pl $dna = 'ATCGTTGAA' Got a TGA stop codon at position 8, immediately following [CGT] $dna = 'ATCGTTGAATGCAAATGACATGAC' Got a TGA stop codon at position 18, immediately following [AAA] Use of uninitialized value $1 in print at 1476_SoPW.pl line 43, <DATA> + line 2. Got a TGA stop codon at position 23, immediately following [] 0:10 >

    Adding a \G anchor to the regex:

    while ($dna =~ /\G(\w\w\w)*?TGA/g)

    fixes the results for both dna strings, because \G means Match only at pos() (e.g. at the end-of-match position of prior m//g) (see “Assertions” in perlre), and initially pos() is set at zero.

    <Begin update> choroba is of course correct, anchoring to the start of the string finds only the first match.

    But that means that the regex could also be fixed without recourse to \G, by simply anchoring it to the start of the string:

    while ($dna =~ /^(\w\w\w)*?TGA/g)

    <End update>

    Perhaps not Perl documentation’s finest hour. :-)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      But that means that the regex could also be fixed without recourse to \G, by simply anchoring it to the start of the string:
      while ($dna =~ /^(\w\w\w)*?TGA/g)
      With ^ instead of \G, the regex would match only once, even with the /g modifier: because after a succesful match, the next match starts when the previous one ended, and ^ can't match there. With \G, though, you can get all the matches from the loop.
      #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; my $dna = join q(), qw( TGA TGA ATG AGA ); for my $regex (qr/(\w\w\w)*?TGA/, qr/^(\w\w\w)*?TGA/, qr/\G(\w\w\w)*?TGA/, ) { while ($dna =~ /$regex/g) { say "TGA with $regex: ", pos $dna; } }
      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      I thought the documentation said that:

      ... which prints
      Got a TGA stop codon at position 18 Got a TGA stop codon at position 23

      Position 18 is good, but position 23 is bogus. What happened?

      The answer is that our regexp works well until we get past the last real match. Then the regexp will fail to match a synchronized TGA and start stepping ahead one character position at a time, not what we want. The solution is to use \G to anchor the match to the codon alignment: ...

      If there is no match at all, I would assume that we are always past the last real match.

        By the way, i still don't understand where the regexp backtracks when it fails and test again. (\w\w\w)*?TGA

        ATCGTTGAA

        Step 1: match the leftmost part 0 times ok, no TGA after, therefore no match. Step2: We start from?
      I see! So i'm not the only one! Cheers Athanasius, and thank you very much.
Re: Understanding a portion on the Perlretut
by Anonymous Monk on Dec 09, 2015 at 19:34 UTC
    can anybody explain me step by step the procedure that the regexp engine does with the provided string?
    Just wanted to say that this is a good question, and, indeed, understanding the algorithm of something is the best way to learn it (IMO).

    Unfortunately, explaining it step by step just takes too long - note, it's not difficult to explain or to understand - it just takes too much typing, since there are a lot of repetitive steps. Maybe someone'll do it anyway? Or maybe not...

    So instead I recommend to read "Mastering regular expressions" by J. Friedl. It has very detailed explanations, and, AFAIK, it's still the best book about regexes. Most of examples are in Perl. It's not a big book (there are several appendixes which you can simply skip).