in reply to Extracting multiple rows in a text file with a regex.

You're processing the data line by line, so the regex will never see enough of the text to match.

It would help a lot if you could show 2 or 3 input records at least, so we can see how they're delimited.

  • Comment on Re: Extracting multiple rows in a text file with a regex.

Replies are listed 'Best First'.
Re^2: Extracting multiple rows in a text file with a regex.
by Anonymous Monk on Jul 28, 2013 at 07:56 UTC
    Hi Loops.

    The input file is formatted in the following manner:

    GeneID: 1002 Name: cadherin 4, type 1, R-cadherin (retinal) Chromo: 20 Cytoband: 20q13.3 Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa GeneID: 10077 Name: tetraspanin 32 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc GeneID: 10078 Name: tumor suppressing subtransferable candidate 4 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

      Hey there. So a common trick is to use the $/ perl variable. It enables you to read input data record-by-record instead of by line. When set to "" the input will be read as blocks that are separated by empty lines. So, with the input file you provided:

      use strict; use autodie; open my $FH, '<', 'test.dat'; open my $FH2, '>', 'seq.dat'; for (do {local $/ = ""; <$FH>}) { my ($name) = /Name:\s+(.+)/; my ($seq) = /Nucleotide\sSequence:\s(.*)/xms; $seq =~ s/\n//g; print $FH2 "$name\n$seq\n\n"; }
      Produces:
      cadherin 4, type 1, R-cadherin (retinal) atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggcacagcgagactggagatatcgt +cacagtggcggctggcctggaccgagagaaagttcagcagtacacagcagcttgcgcatcctgtacctg +gaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa tetraspanin 32 atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc tumor suppressing subtransferable candidate 4 atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
Re^2: Extracting multiple rows in a text file with a regex.
by Anonymous Monk on Jul 28, 2013 at 07:47 UTC
    Hi.

    I modified the code to the following:

    use strict; use warnings; use autodie; open FH, '<', 'test.dat'; my $data = do {local $/; <FH>}; close FH; open FH2, '>', 'seq.dat'; if ($data =~ /Name:\s(.*)?.*Nucleotide\sSequence:\s(.*)?/xms) { print "Yes!\n"; print FH2 $1, $2, "\n\n"; } close FH2;


    Unfortunately, the output file now contains all of the fields listed in my original post except for the first line in which the Name is omitted.

      The regex needs to specify where each of the two capture groups ends:

      use strict; use warnings; use autodie; $_ = do { local $/; <DATA> }; while (/ Name: \s+ (.*?) $ # $ ends 1st capture at en +d-of-line .*? # (ignored) Nucleotide \s+ Sequence: \s+ (.*?) # 2nd capture group... (?: GeneID | \z) # terminated by 'GeneID' + or # end-of-string /gmsx) { my ($name, $seq) = ($1, $2); chomp($name, $seq); print "\n$name\n\n$seq\n"; } __DATA__ GeneID: 1002 Name: cadherin 4, type 1, R-cadherin (retinal) Chromo: 20 Cytoband: 20q13.3 Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa GeneID: 10077 Name: tetraspanin 32 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc GeneID: 10078 Name: tumor suppressing subtransferable candidate 4 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

      Output:

      18:24 >perl 675_SoPW.pl cadherin 4, type 1, R-cadherin (retinal) atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa tetraspanin 32 atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc tumor suppressing subtransferable candidate 4 atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa 18:24 >

      Note the while loop used in conjunction with the /g modifier on the regex — see “Global matching” in Using regular expressions in Perl.

      Hope that helps,

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,