MoniqueLT has asked for the wisdom of the Perl Monks concerning the following question:

PLEASE HELP!! Evey way I attempt to do this I am not successful!

I and trying to extract the sequence in between two primers. For example if my primers are ATGAA and TGCCG, and the sequence is AATCGGGTATGAAAAATTTTGCCGGCGTTTGCG I want to get AAAATTT,

I tried this by using split() by the first primer save value then split() by the second primer, but some sequence have multiple ATGG or TGCCG at it splits to many times

So I tried it using the m// function, something like

$seq=~ m/.*ATGAA.*?TGCCG.*/; $match=$_;

but this isn't working either!! I know there is a simple way, but I can't seem to find a helpful function!! Any help would be greatly appreciated!

Replies are listed 'Best First'.
Re: Finding patterns
by roboticus (Chancellor) on May 24, 2012 at 00:58 UTC

    MoniqueLT:

    You're close. First, you don't need the .* at the beginning and end of the match. Next, you need to tell the regex to capture the sequence you're interested in. Finally, the captured text doesn't fall into $_, it falls into $1. Read perldoc perlre and look at the capture buffers section.

    Here's a quickie example:

    $ cat abc.pl #!/usr/bin/perl use strict; use warnings; my $string = "AATCGGGTATGAAAAATTTTGCCGGCGTTTGCG"; if ($string =~ /ATGAA(.*?)TGCCG/) { my $sequence = $1; print "Found it: '$1'\n"; } else { print "I don't see it!\n"; } $ perl abc.pl Found it: 'AAATTT'

    Update: s/capture groups/capture buffers/

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Finding patterns
by AnomalousMonk (Archbishop) on May 24, 2012 at 03:58 UTC
    ... primers are ATGAA and TGCCG, and the sequence is AATCGGGTATGAAAAATTTTGCCGGCGTTTGCG I want to get AAAATTT ...

    ... some sequence have multiple ATGG or TGCCG ...

    MoniqueLT: I am a bit confused between ATGAA, ATGA and ATGG, but I will assume that in your example you wanted to extract AAAATTT ultimately. The example below shows one approach of many possible. Of course, this approach does not provide positional information on the matches, which I suspect you will ultimately need. Please post a follow-up in this thread if match position information is needed.

    In addition to perlre, see perlretut and perlrequick and the Regex section of this site's Tutorials.

    >perl -wMstrict -le "my $seq = 'AATCGGGTATGAAAAATTTTGCCGGCGTTTGCGATGAATATATTGCCGGAGAGA'; ;; my $prime1 = 'ATGA'; my $prime2 = 'TGCCG'; ;; my @subseqs = $seq =~ m{ $prime1 (.*?) $prime2 }xmsg; printf qq{'$_' } for @subseqs; " 'AAAATTT' 'ATATAT'

    Update: What the heck – positional info is simple enough, here's an approach. See  @- and  @+ in perlvar.

    >perl -wMstrict -le "my $seq = 'AATCGGGTATGAAAAATTTTGCCGGCGTTTGCGATGAATATATTGCCGGAGAGA'; ;; my $prime1 = 'ATGA'; my $prime2 = 'TGCCG'; ;; while ($seq =~ m{ $prime1 (.*?) $prime2 }xmsg) { printf qq{matched '%s' at %d thru %d \n}, $1, $-[1], $+[1]-1; } " matched 'AAAATTT' at 12 thru 18 matched 'ATATAT' at 37 thru 42
Re: Finding patterns
by snape (Pilgrim) on May 24, 2012 at 00:59 UTC

    This might work for u

    if ($seq =~ m/ATGAA(.*)TGCCG/){ print $1; }

    Also, the following link1 and link2 talks about the concept of $1 and use of regular expression

Re: Finding patterns
by jack123 (Acolyte) on May 24, 2012 at 05:57 UTC
    What I understand from your question that you want to find string between two primers, now the two primers shown by you is actually having no string in between them, so I changed it little bit in this way AATCGGGTATGAAAAATTTxyzTGCCGGCGTTTGCG now the text you'll get in between these two primers is xyz. Try the following code and it'll work fine.
    $a = 'AATCGGGTATGAAAAATTTxyzTGCCGGCGTTTGCG'; while($a=~/AAAATTT(.*?)TGCCG/g){ print "The variable is $1"; }
    Let me know if your question is different than this.

      WOW those responses were quick!! Thanks so much! Switching out the $_; for $1, and adding () around .*? did the trick!!

      Sorry my question was a bit unclear! Thanks again guys!

Re: Finding patterns
by Anonymous Monk on May 24, 2012 at 00:41 UTC

    I know there is a simple way

    What is it? What makes the ATGG or TGCCG in the middle different from the one at the end?