in reply to Simple regex question. Grouping with a negative lookahead assertion.
One of your main problems here is deciding that you needed a look-ahead assertion: you don't. (See Look-Around Assertions in perlre - Extended Patterns for details.)
It's useful to show actual and expected output. Here's what I get:
$ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+)(?!(taa|tag|tga))/xms) { say $1; } ' ggataaaaatataggctataaatggcgccccggctaattttt
So, that skips everything until 'atg' is found; after that, as many as possible of [acgt] are captured as long as your rule (must not be followed by taa|tag|tga) is adhered to. The end of the $dna string is not "followed by taa|tag|tga" so the successful match ends there.
What you really want to do is stop capturing when taa|tag|tga is found. That would be:
$ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+)(?:taa|tag|tga)/xms) { say $1; } ' ggataaaaatataggctataaatggcgccccggc
So now, as many as possible of [acgt] are captured until taa|tag|tga is found.
Furthermore, it looks like you want "as few as possible of [acgt]" instead of "as many as possible of [acgt]":
$ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg([acgt]+?)(?:taa|tag|tga)/xms) { say $1; } ' gga
You can clean that up by replacing [acgt] with . (you only have those four letters in $dna and, indeed, in DNA) and removing the three modifiers xms which you make no use of.
$ perl -Mstrict -Mwarnings -E ' my $dna = q{atctcggataatgggataaaaatataggctataaatggcgccccggctaatttt +t}; if ($dna =~ /atg(.+?)(?:taa|tag|tga)/) { say $1; } ' gga
I note that the modifiers xms are written in the same (alphabetically) unordered way as they appear throughout Perl Best Practices (PBP). So, either you've just copied those from somewhere else and don't know what they mean (see perlre - Modifiers) or you're required to follow PBP. If the latter, you should use warnings (see also -w in perlrun) and the regular expression would be better as:
/atg (.+?) (?>taa|tag|tga)/msx
(?>pattern) is also explained in perlre - Extended Patterns.
Finally, in your real code, unless you're ensuring that $dna is always lowercase (e.g. by using lc), you should also add the i modifier (also described in perlre - Modifiers).
-- Ken
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Simple regex question. Grouping with a negative lookahead assertion.
by AnomalousMonk (Archbishop) on Jul 15, 2013 at 09:40 UTC | |
by kcott (Archbishop) on Jul 15, 2013 at 13:22 UTC | |
by AnomalousMonk (Archbishop) on Jul 15, 2013 at 19:38 UTC |