Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I am attempting to extract the Name and Nucleotide Sequence information from a file which has the following format:

GeneID: 1002 Name: cadherin 4, type 1, R-cadherin (retinal) Chromo: 20 Cytoband: 20q13.3 Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcg acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagc cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacaga


The code I wrote to accomplish this seemingly simple task is below:

#!/usr/bin/perl -w use strict; use autodie; open FH, '<', 'test.dat'; my @data = <FH>; close FH; open FH2, '>', 'seq.dat'; for (@data) { if (/Name:\s(.*?).*Nucleotide\sSequence:\s(.*?)/xms) { print FH2 $1, $2, "\n\n"; } } close FH2;


Ideally, the output file will consist of a collection of two rows in the following format (where 'blah' is the name and 'foo' is the sequence):

blah
foo

blah
foo

...

Thanks for the help.

Replies are listed 'Best First'.
Re: Extracting multiple rows in a text file with a regex.
by Skeeve (Parson) on Jul 28, 2013 at 09:02 UTC

    If you don't want to "slurp" in the whole file at once:

    my $name; my $seq; while (<DATA>) { chomp; if (s/^Name:\s*//) { $name= $_; next; } if (s/^Nucleotide Sequence:\s*//) { $seq= $_; while (<DATA>) { last if /^s*$/; chomp; $seq.= $_; } print "$name\n$seq\n\n"; } } __DATA__ GeneID: 1002 Name: cadherin 4, type 1, R-cadherin (retinal) Chromo: 20 Cytoband: 20q13.3 Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa GeneID: 10077 Name: tetraspanin 32 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc GeneID: 10078 Name: tumor suppressing subtransferable candidate 4 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Extracting multiple rows in a text file with a regex.
by 2teez (Vicar) on Jul 28, 2013 at 09:51 UTC

    If I may add this. You can step through your data line by line getting what you want, then to get all your "Nucleotide Sequence", since you have a "blanck" line used as delimiter, then you could use perl "flip-flop" operator (..) as it is called like so:

    use warnings; use strict; while(<DATA>){ if(/Name:\s+?(.+?)$/){ print $1,$/; } if(/Nucleotide Sequence/../^\s*$/){ # use "flip-flop" operator s/.*:\s+?//; # remove the Nucleotide Sequence to :,then print print } } __DATA__ GeneID: 1002 Name: cadherin 4, type 1, R-cadherin (retinal) Chromo: 20 Cytoband: 20q13.3 Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa GeneID: 10077 Name: tetraspanin 32 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc GeneID: 10078 Name: tumor suppressing subtransferable candidate 4 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
    Produces ..
    cadherin 4, type 1, R-cadherin (retinal) atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa tetraspanin 32 atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc tumor suppressing subtransferable candidate 4 atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me

      I'd move the flip-flop operator to the top of the loop. Testing for "Name", while the "flip" didn't "flop" is useless.

      Additionally I'd either make the second "if" an "elsif" or leave the "if" block with a "next" in order not to check more than required.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Extracting multiple rows in a text file with a regex.
by Loops (Curate) on Jul 28, 2013 at 07:34 UTC

    You're processing the data line by line, so the regex will never see enough of the text to match.

    It would help a lot if you could show 2 or 3 input records at least, so we can see how they're delimited.

      Hi Loops.

      The input file is formatted in the following manner:

      GeneID: 1002 Name: cadherin 4, type 1, R-cadherin (retinal) Chromo: 20 Cytoband: 20q13.3 Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa GeneID: 10077 Name: tetraspanin 32 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc GeneID: 10078 Name: tumor suppressing subtransferable candidate 4 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

        Hey there. So a common trick is to use the $/ perl variable. It enables you to read input data record-by-record instead of by line. When set to "" the input will be read as blocks that are separated by empty lines. So, with the input file you provided:

        use strict; use autodie; open my $FH, '<', 'test.dat'; open my $FH2, '>', 'seq.dat'; for (do {local $/ = ""; <$FH>}) { my ($name) = /Name:\s+(.+)/; my ($seq) = /Nucleotide\sSequence:\s(.*)/xms; $seq =~ s/\n//g; print $FH2 "$name\n$seq\n\n"; }
        Produces:
        cadherin 4, type 1, R-cadherin (retinal) atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggcacagcgagactggagatatcgt +cacagtggcggctggcctggaccgagagaaagttcagcagtacacagcagcttgcgcatcctgtacctg +gaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa tetraspanin 32 atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc tumor suppressing subtransferable candidate 4 atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
      Hi.

      I modified the code to the following:

      use strict; use warnings; use autodie; open FH, '<', 'test.dat'; my $data = do {local $/; <FH>}; close FH; open FH2, '>', 'seq.dat'; if ($data =~ /Name:\s(.*)?.*Nucleotide\sSequence:\s(.*)?/xms) { print "Yes!\n"; print FH2 $1, $2, "\n\n"; } close FH2;


      Unfortunately, the output file now contains all of the fields listed in my original post except for the first line in which the Name is omitted.

        The regex needs to specify where each of the two capture groups ends:

        use strict; use warnings; use autodie; $_ = do { local $/; <DATA> }; while (/ Name: \s+ (.*?) $ # $ ends 1st capture at en +d-of-line .*? # (ignored) Nucleotide \s+ Sequence: \s+ (.*?) # 2nd capture group... (?: GeneID | \z) # terminated by 'GeneID' + or # end-of-string /gmsx) { my ($name, $seq) = ($1, $2); chomp($name, $seq); print "\n$name\n\n$seq\n"; } __DATA__ GeneID: 1002 Name: cadherin 4, type 1, R-cadherin (retinal) Chromo: 20 Cytoband: 20q13.3 Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa GeneID: 10077 Name: tetraspanin 32 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc GeneID: 10078 Name: tumor suppressing subtransferable candidate 4 Chromo: 11 Cytoband: 11p15.5 Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

        Output:

        18:24 >perl 675_SoPW.pl cadherin 4, type 1, R-cadherin (retinal) atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa tetraspanin 32 atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc tumor suppressing subtransferable candidate 4 atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa 18:24 >

        Note the while loop used in conjunction with the /g modifier on the regex — see “Global matching” in Using regular expressions in Perl.

        Hope that helps,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,