in reply to Finding patterns
... primers are ATGAA and TGCCG, and the sequence is AATCGGGTATGAAAAATTTTGCCGGCGTTTGCG I want to get AAAATTT ...
... some sequence have multiple ATGG or TGCCG ...
MoniqueLT: I am a bit confused between ATGAA, ATGA and ATGG, but I will assume that in your example you wanted to extract AAAATTT ultimately. The example below shows one approach of many possible. Of course, this approach does not provide positional information on the matches, which I suspect you will ultimately need. Please post a follow-up in this thread if match position information is needed.
In addition to perlre, see perlretut and perlrequick and the Regex section of this site's Tutorials.
>perl -wMstrict -le "my $seq = 'AATCGGGTATGAAAAATTTTGCCGGCGTTTGCGATGAATATATTGCCGGAGAGA'; ;; my $prime1 = 'ATGA'; my $prime2 = 'TGCCG'; ;; my @subseqs = $seq =~ m{ $prime1 (.*?) $prime2 }xmsg; printf qq{'$_' } for @subseqs; " 'AAAATTT' 'ATATAT'
Update: What the heck – positional info is simple enough, here's an approach. See @- and @+ in perlvar.
>perl -wMstrict -le "my $seq = 'AATCGGGTATGAAAAATTTTGCCGGCGTTTGCGATGAATATATTGCCGGAGAGA'; ;; my $prime1 = 'ATGA'; my $prime2 = 'TGCCG'; ;; while ($seq =~ m{ $prime1 (.*?) $prime2 }xmsg) { printf qq{matched '%s' at %d thru %d \n}, $1, $-[1], $+[1]-1; } " matched 'AAAATTT' at 12 thru 18 matched 'ATATAT' at 37 thru 42
|
|---|