in reply to Simple regex question. Grouping with a negative lookahead assertion.
Just one question: In the sequence 'atgaaaaa' (which is not terminated by any of (taa|tag|tga)), what should be matched? From the discussion in the thread so far, I assume the answer is 'nothing'.
With that assumption in hand, here's a small variation on BrowserUk's approach, which is easily adapted to capture all kinds of info about each match. This needs Perl version 5.10+ for ${^MATCH} and \K and the //p regex modifier. If only the matching sub-sequences are needed, it can capture directly to an array. Because it does not use capture groups, it may be slightly faster, but I have not Benchmark-ed this.
>perl -wMstrict -le "my $dna = 'atctcggataatgggataaaaatataggctataaatggcgccccggctaattttt'; ;; my @sub_seqs; push @sub_seqs, [ ${^MATCH}, $-[0] ] while $dna =~ m{ atg \K [acgt]+? (?= taa | tag | tga) }xmspg; ;; printf qq{%d sub-sequence(s) \n}, scalar @sub_seqs; ;; print $dna if @sub_seqs; for my $ar_sub_seq (@sub_seqs) { my $cursor = ('-' x $ar_sub_seq->[1]) . ('^' x length $ar_sub_seq->[0]); print $cursor; } ;; my @ss = $dna =~ m{ atg \K [acgt]+? (?= taa | tag | tga) }xmspg; printf qq{'$_' } for @ss; " 2 sub-sequence(s) atctcggataatgggataaaaatataggctataaatggcgccccggctaattttt -------------^^^ -------------------------------------^^^^^^^^^^ 'gga' 'gcgccccggc'
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Simple regex question. Grouping with a negative lookahead assertion.
by BrowserUk (Patriarch) on Jul 14, 2013 at 06:42 UTC | |
by AnomalousMonk (Archbishop) on Jul 14, 2013 at 22:47 UTC | |
by kcott (Archbishop) on Jul 15, 2013 at 08:43 UTC |