imlou has asked for the wisdom of the Perl Monks concerning the following question:

i'm looking at a piece of code and trying understand it, but i get stuck at this point. please help and explain what the following lines mean. Thanks.
while($seq =~/(ATG)(...)*?(?=TAA|TAG|TGA)/og){ $start = length($'); }
i've never run into code like this, though my knowledge of perl isn't that deep either. Thanks.

Replies are listed 'Best First'.
Re: explanation of code pls
by Enlil (Parson) on Dec 06, 2003 at 21:05 UTC
    The code goes like so:
    while the pattern matches what is in $seq
       assign $start the value of the length of the whatever is left in $seq after what was currently matched ($').

    the explaination of the pattern using japhy's Yape::Regex::Explain module goes as follows:

    The regular expression: (?-imsx:/(ATG)(...)*?(?=TAA|TAG|TGA)/og) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- / '/' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- ATG 'ATG' ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ( group and capture to \2 (0 or more times (matching the least amount possible)): ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- )*? end of \2 (NOTE: because you're using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \2) ---------------------------------------------------------------------- (?= look ahead to see if there is: ---------------------------------------------------------------------- TAA 'TAA' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- TAG 'TAG' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- TGA 'TGA' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- /og '/og' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
    as for the modifiers /og on the pattern take a look at perldoc perlop.

    Anyhow, hope this helps.

    -enlil

Re: explanation of code pls
by graff (Chancellor) on Dec 06, 2003 at 21:05 UTC
    The condition in the while loop will be true for as many times as there are substrings contained in $seg that begin with "ATG", followed by zero or more sets of any three characters, and end with any of "TGA", "TAG" or "TAA". The "*?" bit after the "(...)" indicates that this pattern (any three characters) should be "non-greedy" -- i.e. match as few times as possible before looking for the "(?=TAA|TAG|TGA)".

    For each iteration where this test succeeds, the length of the string that follows the match is assigned to $start -- because the final "TAA or TAG or TGA" part of the match is used within "(?= )", it gets counted/included as part of the string that follows the match. (Look up the $' variable in perlvar, and the "(?=pattern)" construct in perlre.)

Re: explanation of code pls
by jweed (Chaplain) on Dec 06, 2003 at 21:06 UTC
    The regular expression finds all occurances of ATG, followed by a set of zero or more codons (I assume we're working with DNA) matched in a non-greedy manner (i.e. as few as possible), followed by TAA,TAG,or TGA. For each match, it makes $start equal to the length of the portion matched by the regex. Make sense?

    P.S. the /o modifier is not necessary.

    P.P.S perldoc perlre
    perldoc perlvar

    Update
    I'm late. I hate it when that happens.


    Who is Kayser Söze?
Re: explanation of code pls
by duff (Parson) on Dec 06, 2003 at 21:14 UTC

    Since you didn't say exactly what parts you are having trouble with, I'm going to assume you need help on everything. To wit: read perlvar for an explanation of the $' variable, perldoc -f length for explanation of the length() builtin function, perlre + perlretut for explanations of regular expressions.

    Assuming it's just the RE that's giving you trouble, it tries to match "ATG" followed by zero or more sequences of three characters (non greedily) followed by any one of "TAA", "TAG" or "TGA" but does not consume those 3 alternates. (?=...) is a zero-width positive lookahead assertion. It's a way to match without consuming input. So that RE will match strings like "ATGTAG" or "ATGXXXTAA" without consuming the TAA portion so that it can match again if the string were really "ATGXXXTAATGTAG" (but presumably this is DNA so there are no "X" characters :-)

Re: explanation of code pls
by bradcathey (Prior) on Dec 06, 2003 at 21:02 UTC
    The best RegEx tutorial on the web is here

    Update
    Oops, my bad, this almost looks like it came from the tutorial (same codons, at least). Thanks other monks for not being so terse and explaining the RE.

    —Brad
    "A little yeast leavens the whole dough."