in reply to Simple regex question. Grouping with a negative lookahead assertion.

I am trying to capture the very short string after the 'atg' and before (not including) either 'taa', 'tag', or 'tga'.

This doesn't explain adequately what you're trying to match,

but the program/regex you posted does actually match, so what is the problem that you're trying to solve?

see how it runs with use re 'debug';

#!/usr/bin/perl -- use warnings; use strict; use re 'debug'; my $dna = 'atctcggataatgggataaaaatataggctataaatggcgccccggctaattttt'; if ($dna =~ /atg([acgt]+)(?!(taa|tag|tga))/xms) { print $1; } __END__ $ perl fudge Compiling REx "atg([acgt]+)(?!(taa|tag|tga))" Final program: 1: EXACT <atg> (3) 3: OPEN1 (5) 5: PLUS (17) 6: ANYOF[acgt][] (0) 17: CLOSE1 (19) 19: UNLESSM[0] (36) 21: OPEN2 (23) 23: EXACT <t> (25) 25: TRIE-EXACT[ag] (32) <aa> <ag> <ga> 32: CLOSE2 (34) 34: SUCCEED (0) 35: TAIL (36) 36: END (0) anchored "atg" at 0 (checking anchored) minlen 4 Guessing start of match in sv for REx "atg([acgt]+)(?!(taa|tag|tga))" +against "atctcggataatgggataaaaatataggctataaatggcgc cccggctaattttt" Found anchored substr "atg" at offset 10... Starting position does not contradict /^/m... Guessed: match at offset 10 Matching REx "atg([acgt]+)(?!(taa|tag|tga))" against "atgggataaaaatata +ggctataaatggcgccccggctaattttt" 10 <ggata> <atgggataaa> | 1:EXACT <atg>(3) 13 <taatg> <ggataaaaat> | 3:OPEN1(5) 13 <taatg> <ggataaaaat> | 5:PLUS(17) ANYOF[acgt][] can match 42 times out + of 2147483647... 55 <cggctaattttt> <> | 17: CLOSE1(19) 55 <cggctaattttt> <> | 19: UNLESSM[0](36) 55 <cggctaattttt> <> | 21: OPEN2(23) 55 <cggctaattttt> <> | 23: EXACT <t>(25) failed... 55 <cggctaattttt> <> | 36: END(0) Match successful! ggataaaaatataggctataaatggcgccccggctaatttttFreeing REx: "atg([acgt]+)(? +!(taa|tag|tga))"
  • Comment on Re: Simple regex question. Grouping with a negative lookahead assertion.
  • Download Code

Replies are listed 'Best First'.
Re^2: Simple regex question. Grouping with a negative lookahead assertion.
by Anonymous Monk on Jul 14, 2013 at 01:45 UTC
    I am trying to capture the very short string after the 'atg' and before (not including) either 'taa', 'tag', or 'tga'.

    The very short string (three nucleotides) are those after the first 'atg' and before any of the three stop codons (in DNA form, i.e., before transcription has occurred)...in other words 'gga' since the 'taa' which immediately follows should (ideally) prevent further matching.

    For example:

    $dna = q/attatcgatgaaattagggctaatctcgcggggcctat/; ^-^ ^-^ match match and exit


    The characters (nucleotides) between the markers (and only these) should be captured and accessible in $1.