use strict;
use warnings;
use autodie;
$_ = do { local $/; <DATA> };
while (/
Name: \s+ (.*?) $ # $ ends 1st capture at en
+d-of-line
.*? # (ignored)
Nucleotide \s+ Sequence: \s+ (.*?) # 2nd capture group...
(?: GeneID | \z) # terminated by 'GeneID'
+ or
# end-of-string
/gmsx)
{
my ($name, $seq) = ($1, $2);
chomp($name, $seq);
print "\n$name\n\n$seq\n";
}
__DATA__
GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa
GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc
GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
Output:
18:24 >perl 675_SoPW.pl
cadherin 4, type 1, R-cadherin (retinal)
atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa
tetraspanin 32
atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc
tumor suppressing subtransferable candidate 4
atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
18:24 >
Note the while loop used in conjunction with the /g modifier on the regex — see “Global matching” in Using regular expressions in Perl.
Hope that helps,
|