in reply to Simple regex question. Grouping with a negative lookahead assertion.

Just one question: In the sequence  'atgaaaaa' (which is not terminated by any of (taa|tag|tga)), what should be matched? From the discussion in the thread so far, I assume the answer is 'nothing'.

With that assumption in hand, here's a small variation on BrowserUk's approach, which is easily adapted to capture all kinds of info about each match. This needs Perl version 5.10+ for  ${^MATCH} and  \K and the  //p regex modifier. If only the matching sub-sequences are needed, it can capture directly to an array. Because it does not use capture groups, it may be slightly faster, but I have not Benchmark-ed this.

>perl -wMstrict -le "my $dna = 'atctcggataatgggataaaaatataggctataaatggcgccccggctaattttt'; ;; my @sub_seqs; push @sub_seqs, [ ${^MATCH}, $-[0] ] while $dna =~ m{ atg \K [acgt]+? (?= taa | tag | tga) }xmspg; ;; printf qq{%d sub-sequence(s) \n}, scalar @sub_seqs; ;; print $dna if @sub_seqs; for my $ar_sub_seq (@sub_seqs) { my $cursor = ('-' x $ar_sub_seq->[1]) . ('^' x length $ar_sub_seq->[0]); print $cursor; } ;; my @ss = $dna =~ m{ atg \K [acgt]+? (?= taa | tag | tga) }xmspg; printf qq{'$_' } for @ss; " 2 sub-sequence(s) atctcggataatgggataaaaatataggctataaatggcgccccggctaattttt -------------^^^ -------------------------------------^^^^^^^^^^ 'gga' 'gcgccccggc'

Replies are listed 'Best First'.
Re^2: Simple regex question. Grouping with a negative lookahead assertion.
by BrowserUk (Patriarch) on Jul 14, 2013 at 06:42 UTC

    Sorry pal. Most of your posts -- especially those regarding regex -- get an upvote from me, but this one got --. Its a crock.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Sorry pal. Most of your posts -- especially those regarding regex -- get an upvote from me, but this one got --. Its a crock.

      Apparently it was so bad, you tried to -- it three times!

      I was curious about the locus of crockitudinousness and decided to do some benchmarking, usually at the root of these squabbles. (Update: Benchmarked variations include some of those used by kcott here.) I must admit I was shocked, shocked by the results. There were no big surprises until I looked at the effect of the  //p regex modifier. Simply adding this modifier to
          m{ atg ([acgt]+?) (?= taa|tag|tga) }xmsg
      in the  push @ra, $1 variation ($push_cg below, which otherwise performs roughly comparably to the other variations) slows its performance by orders of magnitude, so much so that I didn't have the patience to run the benchmark to completion.

      Am I doing this right? (Update: I.e., is the effect of the use of  //p as in the  $push_KM sub below, which I don't even have the patience to benchmark, really so egregious?) Is this all down to the  //p modifier? And if so, have the proper authorities been notified? If you've touched on this in other threads, I have not been following these discussions as carefully as I ought. Anyway, here's my benchmark code. As always, I would be interested in any comments you might have.

        "Benchmarked variations include some of those used by kcott"

        I'm assuming you're referring to cg_ncg with (?: ... ) and cg_atomic with (?> ... ).

        Prior to posting yesterday, and purely out of curiousity, I ran /atg(.+?)(?:taa|tag|tga)/ and /atg(.+?)(?>taa|tag|tga)/ through Regexp::Debugger looking at the matching process step-by-step. From memory, ?: took 64 steps (in total) to complete the match while ?> took 66 steps. That probably accounts for the cg_atomic vs. cg_ncg 3% (66/64 = 1.03125).

        Again from memory, the two extra steps occurred after failing to match taa|tag|tga after either the 'a' or 't' of 'atg'. For the ?: case, the steps were something like: "(?:" start non-capture group; "taa" no match; "|" next alt; ...; "tga" no match. For the ?> case: "(?>" start non-backtracking group; ... as for ?: ...; (then the additional step) ")" end non-backtracking group.

        Obviously, you can check that yourself if you're so inclined. I wasn't inclined to repeat the process. :-)

        [I haven't analysed your benchmarking further.]

        -- Ken