ovo has asked for the wisdom of the Perl Monks concerning the following question:

I have data that looks like this:

>gi|77641047|gb|ABB00395.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILRSEIW >gi|77641045|gb|ABB00394.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN (many more lines repeated like this)

I want to build a regular expression to capture everything between the ">" including the first ">" but excluding the last ">"

I thought this would work. Assume $seqs holds all of the above data.

$seqs =~ /^(\>.*(?=\>))/ms;

It doesn't, rather it captures the entire file. The trick appears to be in how to express the .* so that it grabs only until the next ">". How does one do this?

Replies are listed 'Best First'.
Re: Simple regular expression that isn't simple
by davorg (Chancellor) on Feb 28, 2006 at 15:55 UTC

    Something a bit simpler perhaps:

    my @records = ~$seq =~ /(>[^>]+)/sg;
    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Simple regular expression that isn't simple
by duff (Parson) on Feb 28, 2006 at 15:37 UTC

    What you're asking for is the modifier to * to make it non-greedy. That would be ? used thusly:

    $seqs =~ /^(\>.*?(?=\>))/ms;

    Though you might want to consider choosing an alternative way to delimit your "records". In your text they appear to be separated by a blank line. If you set $/ = "\n\n"; and read one record at a time from your file, maybe that will work well enough for you? Or, you could at least use split to partition your data usefully.

Re: Simple regular expression that isn't simple
by prasadbabu (Prior) on Feb 28, 2006 at 15:37 UTC

    Hi ovo, if i understood your question correctly, here is one way to do it.

    use strict; local $/; my $seqs = <DATA>; my @match = $seqs =~ /(\>(?:(?:(?!>).)*))/gs; print join "\n", @match; __DATA__ >gi|77641047|gb|ABB00395.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILRSEIW >gi|77641045|gb|ABB00394.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN (many more lines repeated like this) output: >gi|77641047|gb|ABB00395.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILRSEIW >gi|77641045|gb|ABB00394.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN (many more lines repeated like this)

    Prasad

Re: Simple regular expression that isn't simple
by thundergnat (Deacon) on Feb 28, 2006 at 18:30 UTC

    If your data truely looks that way, why don't you avoid all that aggravation and just slurp in the file in paragraph mode?

    use warnings; use strict; my @sequences; { local $/ = ''; @sequences = <DATA>; } print @sequences; __DATA__ >gi|77641047|gb|ABB00395.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILR>SEIW >gi|77641045|gb|ABB00394.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN >gi|77641047|gb|ABB00395.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILRSEIW >gi|77641045|gb|ABB00394.1| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN
Re: Simple regular expression that isn't simple
by Fletch (Bishop) on Feb 28, 2006 at 15:34 UTC

    Not a direct answer to your question, but given that this looks vaguely like some sort of genome-y stuff you might check out BioPerl first and see if there's not an off the shelf module that handles whatever this file format is (if it's something standard).

Re: Simple regular expression that isn't simple
by l3v3l (Monk) on Feb 28, 2006 at 17:05 UTC
    Your input seems to match your output ... is that what you were intending? or are you trying to do something like the following:
    # Choose any Range of FASTA records # Called as perl script_name.pl --start start_line --end end_line file +_name # both --start and --end are optional; # no --start => start reading from end of file, end at --end # no --end => start reading from --start, read to end of file # neither => print every line use warnings; use strict; use Getopt::Long; my ($start, $end) = (undef, undef); GetOptions( "start=i" => \$start, "end=i" => \$end, ); $/='>'; while(<>) { if( (!defined($start) || $start <= $.) && (!defined($end) || $. <= $end) ) { print; } }
    or, here are some liners to data munge fasta records that may be closer to what you are looking for:
    # Choose first N FASTA records perl -ne 'BEGIN {$/=">";$o=0}{chomp;$o<N?(/^$/?next:print">$_"):last;$ +o++}' EXAMPLE.fa
    # Choose the Nth FASTA record perl -0x3E -e 'chomp(@lines = <>);print $lines[N]' EXAMPLE.fa
    # Choose any Range of FASTA records perl -0x3E -e 'chomp(@l=<>);for(STrange..ENDrange){print "record $_ is + >$l[$_]\n"}' EXAMPLE.fa perl -0x3E -e 'chomp(@l=<>);print @l[STrange..ENDrange]' EXAMPLE.fa
    # show the number of FASTA records in a given FASTA file perl -0x3E -e '@l=grep/\s/,<>;print "$ARGV contains ".scalar(@l)." rec +ords\n"' EXAMPLE.fa
Re: Simple regular expression that isn't simple
by mickeyn (Priest) on Feb 28, 2006 at 15:40 UTC
    you should try something simpler (untested):
    $seqs =~ /^(>.*)>/ms
    Enjoy,
    Mickey
      $seqs =~ /^(>.*)>/ms
      This will grab everything from the start of the string (assuming the string starts with &gt;, upto but not including the last &gt;.

      While technically correct, reading between the lines the OP probably wants to grab the data in chunks that start with &gt;. (Otherwise, substr would have been a better choice, no?).

      You have to make the * non-greedy. Better to just use a negated character class...

      I see that davorg's suggestion is what I was heading for..

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: Simple regular expression that isn't simple
by Anonymous Monk on Feb 28, 2006 at 16:06 UTC
    There are lots of good suggestions for you but I will expaling also a bit. As far as I know ">" isnt any special char in a regex is it? So you dont have to escape it. Then if you want to get all hits you will need the \g switch at the end of your expression along an array in front of your string to store all the hits. Then I guess you have forgotten that you need to remove "\n" if you want to store all the lines of a file in a single scalar. This has also been shown in an example in this thread.