One of your main problems here is deciding that you needed a look-ahead assertion: you don't. (See Look-Around Assertions in perlre - Extended Patterns for details.)

It's useful to show actual and expected output. Here's what I get:

$ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+)(?!(taa|tag|tga))/xms) { say $1; } ' ggataaaaatataggctataaatggcgccccggctaattttt

So, that skips everything until 'atg' is found; after that, as many as possible of [acgt] are captured as long as your rule (must not be followed by taa|tag|tga) is adhered to. The end of the $dna string is not "followed by taa|tag|tga" so the successful match ends there.

What you really want to do is stop capturing when taa|tag|tga is found. That would be:

$ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+)(?:taa|tag|tga)/xms) { say $1; } ' ggataaaaatataggctataaatggcgccccggc

So now, as many as possible of [acgt] are captured until taa|tag|tga is found.

Furthermore, it looks like you want "as few as possible of [acgt]" instead of "as many as possible of [acgt]":

$ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+?)(?:taa|tag|tga)/xms) { say $1; } ' gga

You can clean that up by replacing [acgt] with . (you only have those four letters in $dna and, indeed, in DNA) and removing the three modifiers xms which you make no use of.

$ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg(.+?)(?:taa|tag|tga)/) { say $1; } ' gga

I note that the modifiers xms are written in the same (alphabetically) unordered way as they appear throughout Perl Best Practices (PBP). So, either you've just copied those from somewhere else and don't know what they mean (see perlre - Modifiers) or you're required to follow PBP. If the latter, you should use warnings (see also -w in perlrun) and the regular expression would be better as:

/atg (.+?) (?>taa|tag|tga)/msx

(?>pattern) is also explained in perlre - Extended Patterns.

Finally, in your real code, unless you're ensuring that $dna is always lowercase (e.g. by using lc), you should also add the i modifier (also described in perlre - Modifiers).

-- Ken


In reply to Re: Simple regex question. Grouping with a negative lookahead assertion. by kcott
in thread Simple regex question. Grouping with a negative lookahead assertion. by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.