in reply to Re: Extracting multiple rows in a text file with a regex.
in thread Extracting multiple rows in a text file with a regex.

Hi.

I modified the code to the following:

use strict; use warnings; use autodie; open FH, '<', 'test.dat'; my $data = do {local $/; <FH>}; close FH; open FH2, '>', 'seq.dat'; if ($data =~ /Name:\s(.*)?.*Nucleotide\sSequence:\s(.*)?/xms) { print "Yes!\n"; print FH2 $1, $2, "\n\n"; } close FH2;


Unfortunately, the output file now contains all of the fields listed in my original post except for the first line in which the Name is omitted.

Replies are listed 'Best First'.
Re^3: Extracting multiple rows in a text file with a regex.
by Athanasius (Archbishop) on Jul 28, 2013 at 08:30 UTC

    The regex needs to specify where each of the two capture groups ends:

    use strict; use warnings; use autodie; $_ = do { local $/; <DATA> }; while (/ Name: \s+ (.*?) $ # $ ends 1st capture at en +d-of-line .*? # (ignored) Nucleotide \s+ Sequence: \s+ (.*?) # 2nd capture group... (?: GeneID | \z) # terminated by 'GeneID' + or # end-of-string /gmsx) { my ($name, $seq) = ($1, $2); chomp($name, $seq); print "\n$name\n\n$seq\n"; } __DATA__ GeneID: 1002 Name: cadherin 4, type 1, R-cadherin (retinal) Chromo: 20 Cytoband: 20q13.3 Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa GeneID: 10077 Name: tetraspanin 32 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc GeneID: 10078 Name: tumor suppressing subtransferable candidate 4 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

    Output:

    18:24 >perl 675_SoPW.pl cadherin 4, type 1, R-cadherin (retinal) atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa tetraspanin 32 atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc tumor suppressing subtransferable candidate 4 atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa 18:24 >

    Note the while loop used in conjunction with the /g modifier on the regex — see “Global matching” in Using regular expressions in Perl.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,