Simple regular expression that isn't simple

ovo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Simple regular expression that isn't simple by davorg (Chancellor) on Feb 28, 2006 at 15:55 UTC
Something a bit simpler perhaps: `my @records = ~$seq =~ /(>[^>]+)/sg;` [download] -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l]
Re: Simple regular expression that isn't simple by duff (Parson) on Feb 28, 2006 at 15:37 UTC
What you're asking for is the modifier to `` to make it non-greedy. That would be `?` used thusly: `$seqs =~ /^(\>.?(?=\>))/ms;` [download] Though you might want to consider choosing an alternative way to delimit your "records". In your text they appear to be separated by a blank line. If you set `$/ = "\n\n";` and read one record at a time from your file, maybe that will work well enough for you? Or, you could at least use `split` to partition your data usefully. duff	[reply] [d/l] [select]
Re: Simple regular expression that isn't simple by prasadbabu (Prior) on Feb 28, 2006 at 15:37 UTC
Hi ovo, if i understood your question correctly, here is one way to do it. use strict; local $/; my $seqs = <DATA>; my @match = $seqs =~ /(\>(?:(?:(?!>).)*))/gs; print join "\n", @match; __DATA__ >gi\|77641047\|gb\|ABB00395.1\| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILRSEIW >gi\|77641045\|gb\|ABB00394.1\| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN (many more lines repeated like this) output: >gi\|77641047\|gb\|ABB00395.1\| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILRSEIW >gi\|77641045\|gb\|ABB00394.1\| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN (many more lines repeated like this) [download] Prasad	[reply] [d/l]
Re: Simple regular expression that isn't simple by thundergnat (Deacon) on Feb 28, 2006 at 18:30 UTC
If your data truely looks that way, why don't you avoid all that aggravation and just slurp in the file in paragraph mode? use warnings; use strict; my @sequences; { local $/ = ''; @sequences = <DATA>; } print @sequences; __DATA__ >gi\|77641047\|gb\|ABB00395.1\| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILR>SEIW >gi\|77641045\|gb\|ABB00394.1\| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN >gi\|77641047\|gb\|ABB00395.1\| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN IFVQGDIGSKIIVTTRKNSVALMMGNEQISMNNLSTEASWSLFKRHAFENMNPMGYPELEEVGKQIAAKC KGLPLALKTLAGMLCSKSEIDEWKRILRSEIW >gi\|77641045\|gb\|ABB00394.1\| I2 [Lycopersicon esculentum] TRTPSTSLVVDSGIFGRQNEIEDLVGRLLSMDTKGKNLAVVPIVGMGGLGKTTLAKAVYNDERVKKHFGL TAWFCVSEAYDAFRITKGILQEIGSTDLKADHNLNQLQVKVKESLKGKKFLIVLDDVWNDNYNEWDDLRN [download]	[reply] [d/l]
Re: Simple regular expression that isn't simple by Fletch (Bishop) on Feb 28, 2006 at 15:34 UTC
Not a direct answer to your question, but given that this looks vaguely like some sort of genome-y stuff you might check out BioPerl first and see if there's not an off the shelf module that handles whatever this file format is (if it's something standard).	[reply]
Re: Simple regular expression that isn't simple by l3v3l (Monk) on Feb 28, 2006 at 17:05 UTC
Your input seems to match your output ... is that what you were intending? or are you trying to do something like the following: # Choose any Range of FASTA records # Called as perl script_name.pl --start start_line --end end_line file +_name # both --start and --end are optional; # no --start => start reading from end of file, end at --end # no --end => start reading from --start, read to end of file # neither => print every line use warnings; use strict; use Getopt::Long; my ($start, $end) = (undef, undef); GetOptions( "start=i" => \$start, "end=i" => \$end, ); $/='>'; while(<>) { if( (!defined($start) \|\| $start <= $.) && (!defined($end) \|\| $. <= $end) ) { print; } } [download] or, here are some liners to data munge fasta records that may be closer to what you are looking for: `# Choose first N FASTA records perl -ne 'BEGIN {$/=">";$o=0}{chomp;$o<N?(/^$/?next:print">$_"):last;$ +o++}' EXAMPLE.fa` [download] `# Choose the Nth FASTA record perl -0x3E -e 'chomp(@lines = <>);print $lines[N]' EXAMPLE.fa` [download] `# Choose any Range of FASTA records perl -0x3E -e 'chomp(@l=<>);for(STrange..ENDrange){print "record $_ is + >$l[$_]\n"}' EXAMPLE.fa perl -0x3E -e 'chomp(@l=<>);print @l[STrange..ENDrange]' EXAMPLE.fa` [download] `# show the number of FASTA records in a given FASTA file perl -0x3E -e '@l=grep/\s/,<>;print "$ARGV contains ".scalar(@l)." rec +ords\n"' EXAMPLE.fa` [download]	[reply] [d/l] [select]
Re: Simple regular expression that isn't simple by mickeyn (Priest) on Feb 28, 2006 at 15:40 UTC
you should try something simpler (untested): `$seqs =~ /^(>.*)>/ms` [download] Enjoy, Mickey	[reply] [d/l]
Re^2: Simple regular expression that isn't simple by QM (Parson) on Feb 28, 2006 at 18:10 UTC
`$seqs =~ /^(>.)>/ms` [download] This will grab everything from the start of the string (assuming the string starts with `>`, upto but not including the last `>`. While technically correct, reading between the lines the OP probably wants to grab the data in chunks that start with `>`. (Otherwise, `substr` would have been a better choice, no?). You have to make the `` non-greedy. Better to just use a negated character class... I see that davorg's suggestion is what I was heading for.. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re: Simple regular expression that isn't simple by Anonymous Monk on Feb 28, 2006 at 16:06 UTC
There are lots of good suggestions for you but I will expaling also a bit. As far as I know ">" isnt any special char in a regex is it? So you dont have to escape it. Then if you want to get all hits you will need the \g switch at the end of your expression along an array in front of your string to store all the hits. Then I guess you have forgotten that you need to remove "\n" if you want to store all the lines of a file in a single scalar. This has also been shown in an example in this thread.	[reply]