Extracting multiple rows in a text file with a regex.

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Extracting multiple rows in a text file with a regex.
by Skeeve (Parson) on Jul 28, 2013 at 09:02 UTC

If you don't want to "slurp" in the whole file at once:

my $name;
my $seq;
while (<DATA>) {
    chomp;
    if (s/^Name:\s*//) {
        $name= $_;
        next;
    }
    if (s/^Nucleotide Sequence:\s*//) {
        $seq= $_;
        while (<DATA>) {
            last if /^s*$/;
            chomp;
            $seq.= $_;
        }
        print "$name\n$seq\n\n";
    }
}

__DATA__
GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

[reply]
[d/l]
[select]

Re: Extracting multiple rows in a text file with a regex.
by 2teez (Vicar) on Jul 28, 2013 at 09:51 UTC

If I may add this. You can step through your data line by line getting what you want, then to get all your "Nucleotide Sequence", since you have a "blanck" line used as delimiter, then you could use perl "flip-flop" operator (..) as it is called like so:

use warnings;
use strict;


while(<DATA>){
  if(/Name:\s+?(.+?)$/){
    print $1,$/;
  }
  if(/Nucleotide Sequence/../^\s*$/){ # use "flip-flop" operator
     s/.*:\s+?//;  # remove the Nucleotide Sequence to :,then print
    print
  }
}

__DATA__
GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

cadherin 4, type 1, R-cadherin (retinal)
atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

tetraspanin 32
atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

tumor suppressing subtransferable candidate 4
atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

If you tell me, I'll forget.
If you show me, I'll remember.
if you involve me, I'll understand.
--- Author unknown to me

[reply]
[d/l]
[select]

Re^2: Extracting multiple rows in a text file with a regex.

by Skeeve (Parson) on Jul 29, 2013 at 06:05 UTC

I'd move the flip-flop operator to the top of the loop. Testing for "Name", while the "flip" didn't "flop" is useless.

Additionally I'd either make the second "if" an "elsif" or leave the "if" block with a "next" in order not to check more than required.

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

[reply]
[d/l]
[select]

Re: Extracting multiple rows in a text file with a regex.
by Loops (Curate) on Jul 28, 2013 at 07:34 UTC

You're processing the data line by line, so the regex will never see enough of the text to match.

It would help a lot if you could show 2 or 3 input records at least, so we can see how they're delimited.

[reply]

Re^2: Extracting multiple rows in a text file with a regex.

by Anonymous Monk on Jul 28, 2013 at 07:56 UTC

Loops

GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

[reply]
[d/l]

Re^3: Extracting multiple rows in a text file with a regex.

by Loops (Curate) on Jul 28, 2013 at 08:24 UTC

Hey there. So a common trick is to use the $/ perl variable. It enables you to read input data record-by-record instead of by line. When set to "" the input will be read as blocks that are separated by empty lines. So, with the input file you provided:

use strict;
use autodie;

open my $FH, '<', 'test.dat';
open my $FH2, '>', 'seq.dat';

for (do {local $/ = ""; <$FH>}) {
    my ($name) = /Name:\s+(.+)/;
    my ($seq) = /Nucleotide\sSequence:\s(.*)/xms;
    $seq =~ s/\n//g;
    print $FH2 "$name\n$seq\n\n";
}
[download]

cadherin 4, type 1, R-cadherin (retinal)
atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggcacagcgagactggagatatcgt
+cacagtggcggctggcctggaccgagagaaagttcagcagtacacagcagcttgcgcatcctgtacctg
+gaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

tetraspanin 32
atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

tumor suppressing subtransferable candidate 4
atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

[reply]
[d/l]
[select]

Re^2: Extracting multiple rows in a text file with a regex.

by Anonymous Monk on Jul 28, 2013 at 07:47 UTC

use strict;
use warnings;
use autodie;

open FH, '<', 'test.dat';
    my $data = do {local $/; <FH>};
close FH;

open FH2, '>', 'seq.dat';

if ($data =~ /Name:\s(.*)?.*Nucleotide\sSequence:\s(.*)?/xms)
{
    print "Yes!\n";
    print FH2 $1, $2, "\n\n";
}

close FH2;
[download]

[reply]
[d/l]

Re^3: Extracting multiple rows in a text file with a regex.

by Athanasius (Archbishop) on Jul 28, 2013 at 08:30 UTC

The regex needs to specify where each of the two capture groups ends:

use strict;
use warnings;
use autodie;

$_ = do { local $/; <DATA> };

while (/
         Name: \s+ (.*?) $                  # $ ends 1st capture at en
+d-of-line
         .*?                                # (ignored)
         Nucleotide \s+ Sequence: \s+ (.*?) # 2nd capture group...
         (?: GeneID | \z)                   #   terminated by 'GeneID'
+ or
                                            #   end-of-string
       /gmsx)
{
    my   ($name, $seq) = ($1, $2);
    chomp($name, $seq);
    print "\n$name\n\n$seq\n";
}

__DATA__
GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

Output:

18:24 >perl 675_SoPW.pl

cadherin 4, type 1, R-cadherin (retinal)

atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa


tetraspanin 32

atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc


tumor suppressing subtransferable candidate 4

atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

18:24 >
[download]

Note the while loop used in conjunction with the /g modifier on the regex — see “Global matching” in Using regular expressions in Perl.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]