Re^3: Extracting multiple rows in a text file with a regex.

The regex needs to specify where each of the two capture groups ends:

use strict;
use warnings;
use autodie;

$_ = do { local $/; <DATA> };

while (/
         Name: \s+ (.*?) $                  # $ ends 1st capture at en
+d-of-line
         .*?                                # (ignored)
         Nucleotide \s+ Sequence: \s+ (.*?) # 2nd capture group...
         (?: GeneID | \z)                   #   terminated by 'GeneID'
+ or
                                            #   end-of-string
       /gmsx)
{
    my   ($name, $seq) = ($1, $2);
    chomp($name, $seq);
    print "\n$name\n\n$seq\n";
}

__DATA__
GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

Output:

18:24 >perl 675_SoPW.pl

cadherin 4, type 1, R-cadherin (retinal)

atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa


tetraspanin 32

atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc


tumor suppressing subtransferable candidate 4

atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

18:24 >
[download]

Note the while loop used in conjunction with the /g modifier on the regex — see “Global matching” in Using regular expressions in Perl.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Comment on Re^3: Extracting multiple rows in a text file with a regex. Select or Download Code