Understanding a portion on the Perlretut

BlueStarry has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

since a week, i'm studying the perlretut. I'm confused with an example:

         while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?
            print "Got a TGA stop codon at position ", pos $dna, "\n";
        }

which prints

        Got a TGA stop codon at position 18
        Got a TGA stop codon at position 23
[download]

On the regexp: /(\w\w\w)*?TGA/g

can anybody explain me step by step the procedure that the regexp engine does with the provided string?

$dna = "ATCGTTGAATGCAAATGACATGAC"

EDIT:

I was missing the fact that *? matches the empty string too, so even if at the start of the line a TGA is going to be counted. However i still don't understand why this is bugged without \G.

Comment on Understanding a portion on the Perlretut Select or Download Code

Replies are listed 'Best First'.

Re: Understanding a portion on the Perlretut
by Corion (Patriarch) on Dec 09, 2015 at 12:06 UTC

perlretut also has prose text to go with the code. This also motivates why it uses (\w\w\w)*?, namely to progress through the string in triplets instead of trying to match at each character position.

[reply]
[d/l]

Re^2: Understanding a portion on the Perlretut

by BlueStarry (Novice) on Dec 09, 2015 at 13:26 UTC

There is no such sentence on the explanation.

[reply]

Re^3: Understanding a portion on the Perlretut

by Corion (Patriarch) on Dec 09, 2015 at 13:33 UTC

I linked to perlretut. Going there, I find:

The naive regexp
    ...
[download]
doesn't work; it may match a TGA , but there is no guarantee that the match is aligned with codon boundaries, e.g., the substring GTT GAA gives a match. A better solution is
    while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *?
    print "Got a TGA stop codon at position ", pos $dna, "\n";
    }
[download]
which prints
    Got a TGA stop codon at position 18
    Got a TGA stop codon at position 23
[download]
Position 18 is good, but position 23 is bogus. What happened?

Maybe it was too obvious for me, but a Codon is a nucleotide triplet.

[reply]
[d/l]
[select]

Re: Understanding a portion on the Perlretut
by Eily (Monsignor) on Dec 09, 2015 at 13:28 UTC

Well of course I agree with Discipulus, because I do love debuggex :)

Here is something that you can try to understand what happens:

use v5.14;
say "With the ?";
'AATCGTTGAATGCAATGACATGAC' =~ /
                                (\w\w\w)*?
                                (?=(?{say "Checking if <$&> is followe
+d by TGA"})) # Print everything that matched before that point
                                TGA/x;
say "Match: $&";
                                
say "\nWithout the ?";               
'AATCGTTGAATGCAATGACATGAC' =~ /
                                (\w\w\w)*
                                (?=(?{say "Checking if <$&> is followe
+d by TGA"})) # Print everything that matched before that point
                                TGA/x;
say "Match: $&";
[download]

In both case, (\w\w\w)+ is a loop that reads three characters at a time. The difference is that in the first case, each times it reads three characters it lets the last part of the regex test the string (check if it is followed by TGA), if the test failed, three new characters are read and the test is ran again. The (\w\w\w)* loop of the second regex though, keeps reading characters as long are there are three characters to read, and it only lets the last part of the regex be checked after it is done, if the test fails, it goes back (backtracks) one iteration, and tries again.

The /g simply memories the position of the last successful match, and starts reading from there on the next attempt.

[reply]
[d/l]

Re: Understanding a portion on the Perlretut
by Discipulus (Canon) on Dec 09, 2015 at 12:13 UTC

Davido's precious regex tester

~~WRONG match 1,2 or 3 times \w (word char) prefering the minimum amount followed by TGA and capture the result in $1.~~

There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

[reply]
[d/l]

Re: Understanding a portion of perlretut
by Athanasius (Archbishop) on Dec 09, 2015 at 14:22 UTC

I think the documentation is a little misleading here. At least, it gives me the impression that the first match (if any) is somehow guaranteed to be valid (because codon-aligned). But that’s true only if, as in the example given, the $dna string happens to contain a valid match somewhere — in which case, it will be found first. But if it doesn’t, the first match is an invalid one:

#! perl
use strict;
use warnings;

while (my $dna = <DATA>)
{
    chomp $dna;
    print "\n\$dna = '$dna'\n";

    while ($dna =~ /(\w\w\w)*?TGA/g)
    {
        print 'Got a TGA stop codon at position ', pos $dna, 
              ', immediately following [', $1, "]\n";
    }
}

__DATA__
ATCGTTGAA
ATCGTTGAATGCAAATGACATGAC
[download]

Output:

 0:10 >perl 1476_SoPW.pl

$dna = 'ATCGTTGAA'
Got a TGA stop codon at position 8, immediately following [CGT]

$dna = 'ATCGTTGAATGCAAATGACATGAC'
Got a TGA stop codon at position 18, immediately following [AAA]
Use of uninitialized value $1 in print at 1476_SoPW.pl line 43, <DATA>
+ line 2.
Got a TGA stop codon at position 23, immediately following []

 0:10 >
[download]

Adding a \G anchor to the regex:

while ($dna =~ /\G(\w\w\w)*?TGA/g)
[download]

fixes the results for both dna strings, because \G means Match only at pos() (e.g. at the end-of-match position of prior m//g) (see “Assertions” in perlre), and initially pos() is set at zero.

<Begin update> choroba is of course correct, anchoring to the start of the string finds only the first match.

But that means that the regex could also be fixed without recourse to \G, by simply anchoring it to the start of the string:

while ($dna =~ /^(\w\w\w)*?TGA/g)
[download]

<End update>

Perhaps not Perl documentation’s finest hour. :-)

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Understanding a portion of perlretut

by choroba (Cardinal) on Dec 09, 2015 at 16:57 UTC

But that means that the regex could also be fixed without recourse to \G, by simply anchoring it to the start of the string:
while ($dna =~ /^(\w\w\w)*?TGA/g)
[download]

^

\G

/g

^

\G

#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my $dna = join q(), qw( TGA TGA ATG AGA );
for my $regex (qr/(\w\w\w)*?TGA/,
               qr/^(\w\w\w)*?TGA/,
               qr/\G(\w\w\w)*?TGA/,
              ) {
    while ($dna =~ /$regex/g) {
        say "TGA with $regex: ", pos $dna;
    }
}
[download]

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
[download]

[reply]
[d/l]
[select]

Re^2: Understanding a portion of perlretut

by Corion (Patriarch) on Dec 09, 2015 at 15:19 UTC

I thought the documentation said that:

... which prints
    Got a TGA stop codon at position 18
    Got a TGA stop codon at position 23
[download]
Position 18 is good, but position 23 is bogus. What happened?
The answer is that our regexp works well until we get past the last real match. Then the regexp will fail to match a synchronized TGA and start stepping ahead one character position at a time, not what we want. The solution is to use \G to anchor the match to the codon alignment: ...

If there is no match at all, I would assume that we are always past the last real match.

[reply]
[d/l]

Re^3: Understanding a portion of perlretut

by BlueStarry (Novice) on Dec 09, 2015 at 15:52 UTC

(\w\w\w)*?TGA

ATCGTTGAA

[reply]
[d/l]

Re^4: Understanding a portion of perlretut

by Corion (Patriarch) on Dec 09, 2015 at 15:55 UTC

Re^5: Understanding a portion of perlretut

by BlueStarry (Novice) on Dec 09, 2015 at 16:15 UTC

Some notes below your chosen depth have not been shown here

Re^2: Understanding a portion of perlretut

by BlueStarry (Novice) on Dec 09, 2015 at 15:16 UTC

I see! So i'm not the only one! Cheers Athanasius, and thank you very much.

[reply]

Re: Understanding a portion on the Perlretut
by Anonymous Monk on Dec 09, 2015 at 19:34 UTC

can anybody explain me step by step the procedure that the regexp engine does with the provided string?

algorithm

Unfortunately, explaining it step by step just takes too long - note, it's not difficult to explain or to understand - it just takes too much typing, since there are a lot of repetitive steps. Maybe someone'll do it anyway? Or maybe not...

[reply]

Back to Seekers of Perl Wisdom