Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Understanding a portion of perlretut

by Athanasius (Archbishop)
on Dec 09, 2015 at 14:22 UTC ( [id://1149769]=note: print w/replies, xml ) Need Help??


in reply to Understanding a portion on the Perlretut

I think the documentation is a little misleading here. At least, it gives me the impression that the first match (if any) is somehow guaranteed to be valid (because codon-aligned). But that’s true only if, as in the example given, the $dna string happens to contain a valid match somewhere — in which case, it will be found first. But if it doesn’t, the first match is an invalid one:

#! perl use strict; use warnings; while (my $dna = <DATA>) { chomp $dna; print "\n\$dna = '$dna'\n"; while ($dna =~ /(\w\w\w)*?TGA/g) { print 'Got a TGA stop codon at position ', pos $dna, ', immediately following [', $1, "]\n"; } } __DATA__ ATCGTTGAA ATCGTTGAATGCAAATGACATGAC

Output:

0:10 >perl 1476_SoPW.pl $dna = 'ATCGTTGAA' Got a TGA stop codon at position 8, immediately following [CGT] $dna = 'ATCGTTGAATGCAAATGACATGAC' Got a TGA stop codon at position 18, immediately following [AAA] Use of uninitialized value $1 in print at 1476_SoPW.pl line 43, <DATA> + line 2. Got a TGA stop codon at position 23, immediately following [] 0:10 >

Adding a \G anchor to the regex:

while ($dna =~ /\G(\w\w\w)*?TGA/g)

fixes the results for both dna strings, because \G means Match only at pos() (e.g. at the end-of-match position of prior m//g) (see “Assertions” in perlre), and initially pos() is set at zero.

<Begin update> choroba is of course correct, anchoring to the start of the string finds only the first match.

But that means that the regex could also be fixed without recourse to \G, by simply anchoring it to the start of the string:

while ($dna =~ /^(\w\w\w)*?TGA/g)

<End update>

Perhaps not Perl documentation’s finest hour. :-)

Hope that helps,

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^2: Understanding a portion of perlretut
by choroba (Cardinal) on Dec 09, 2015 at 16:57 UTC
    But that means that the regex could also be fixed without recourse to \G, by simply anchoring it to the start of the string:
    while ($dna =~ /^(\w\w\w)*?TGA/g)
    With ^ instead of \G, the regex would match only once, even with the /g modifier: because after a succesful match, the next match starts when the previous one ended, and ^ can't match there. With \G, though, you can get all the matches from the loop.
    #! /usr/bin/perl use warnings; use strict; use feature qw{ say }; my $dna = join q(), qw( TGA TGA ATG AGA ); for my $regex (qr/(\w\w\w)*?TGA/, qr/^(\w\w\w)*?TGA/, qr/\G(\w\w\w)*?TGA/, ) { while ($dna =~ /$regex/g) { say "TGA with $regex: ", pos $dna; } }
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re^2: Understanding a portion of perlretut
by Corion (Patriarch) on Dec 09, 2015 at 15:19 UTC

    I thought the documentation said that:

    ... which prints
    Got a TGA stop codon at position 18 Got a TGA stop codon at position 23

    Position 18 is good, but position 23 is bogus. What happened?

    The answer is that our regexp works well until we get past the last real match. Then the regexp will fail to match a synchronized TGA and start stepping ahead one character position at a time, not what we want. The solution is to use \G to anchor the match to the codon alignment: ...

    If there is no match at all, I would assume that we are always past the last real match.

      By the way, i still don't understand where the regexp backtracks when it fails and test again. (\w\w\w)*?TGA

      ATCGTTGAA

      Step 1: match the leftmost part 0 times ok, no TGA after, therefore no match. Step2: We start from?

        Step 2 is to try the leftmost part 1 time (no TGA found).

        Step 3 is to try the leftmost part 2 times (no TGA found).

        Step 4 (and this is where the naive part goes bad) is to advance the leftmost starting point by one, since the match is unanchored.

Re^2: Understanding a portion of perlretut
by BlueStarry (Novice) on Dec 09, 2015 at 15:16 UTC
    I see! So i'm not the only one! Cheers Athanasius, and thank you very much.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1149769]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (None)
    As of 2024-04-25 03:58 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found