Re^2: Extracting multiple rows in a text file with a regex.

Hi.

I modified the code to the following:

use strict;
use warnings;
use autodie;

open FH, '<', 'test.dat';
    my $data = do {local $/; <FH>};
close FH;

open FH2, '>', 'seq.dat';

if ($data =~ /Name:\s(.*)?.*Nucleotide\sSequence:\s(.*)?/xms)
{
    print "Yes!\n";
    print FH2 $1, $2, "\n\n";
}

close FH2;
[download]

Unfortunately, the output file now contains all of the fields listed in my original post except for the first line in which the Name is omitted.

Comment on Re^2: Extracting multiple rows in a text file with a regex. Download Code

Replies are listed 'Best First'.

Re^3: Extracting multiple rows in a text file with a regex.
by Athanasius (Archbishop) on Jul 28, 2013 at 08:30 UTC

The regex needs to specify where each of the two capture groups ends:

use strict;
use warnings;
use autodie;

$_ = do { local $/; <DATA> };

while (/
         Name: \s+ (.*?) $                  # $ ends 1st capture at en
+d-of-line
         .*?                                # (ignored)
         Nucleotide \s+ Sequence: \s+ (.*?) # 2nd capture group...
         (?: GeneID | \z)                   #   terminated by 'GeneID'
+ or
                                            #   end-of-string
       /gmsx)
{
    my   ($name, $seq) = ($1, $2);
    chomp($name, $seq);
    print "\n$name\n\n$seq\n";
}

__DATA__
GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

Output:

18:24 >perl 675_SoPW.pl

cadherin 4, type 1, R-cadherin (retinal)

atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa


tetraspanin 32

atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc


tumor suppressing subtransferable candidate 4

atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

18:24 >
[download]

Note the while loop used in conjunction with the /g modifier on the regex — see “Global matching” in Using regular expressions in Perl.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]