Re: Extracting multiple rows in a text file with a regex.

Replies are listed 'Best First'.

Re^2: Extracting multiple rows in a text file with a regex.
by Anonymous Monk on Jul 28, 2013 at 07:56 UTC

GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

[reply]
[d/l]

Re^3: Extracting multiple rows in a text file with a regex.

by Loops (Curate) on Jul 28, 2013 at 08:24 UTC

Hey there. So a common trick is to use the $/ perl variable. It enables you to read input data record-by-record instead of by line. When set to "" the input will be read as blocks that are separated by empty lines. So, with the input file you provided:

use strict;
use autodie;

open my $FH, '<', 'test.dat';
open my $FH2, '>', 'seq.dat';

for (do {local $/ = ""; <$FH>}) {
    my ($name) = /Name:\s+(.+)/;
    my ($seq) = /Nucleotide\sSequence:\s(.*)/xms;
    $seq =~ s/\n//g;
    print $FH2 "$name\n$seq\n\n";
}
[download]

cadherin 4, type 1, R-cadherin (retinal)
atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggcacagcgagactggagatatcgt
+cacagtggcggctggcctggaccgagagaaagttcagcagtacacagcagcttgcgcatcctgtacctg
+gaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

tetraspanin 32
atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

tumor suppressing subtransferable candidate 4
atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

[reply]
[d/l]
[select]

Re^2: Extracting multiple rows in a text file with a regex.
by Anonymous Monk on Jul 28, 2013 at 07:47 UTC

use strict;
use warnings;
use autodie;

open FH, '<', 'test.dat';
    my $data = do {local $/; <FH>};
close FH;

open FH2, '>', 'seq.dat';

if ($data =~ /Name:\s(.*)?.*Nucleotide\sSequence:\s(.*)?/xms)
{
    print "Yes!\n";
    print FH2 $1, $2, "\n\n";
}

close FH2;
[download]

[reply]
[d/l]

Re^3: Extracting multiple rows in a text file with a regex.

by Athanasius (Archbishop) on Jul 28, 2013 at 08:30 UTC

The regex needs to specify where each of the two capture groups ends:

use strict;
use warnings;
use autodie;

$_ = do { local $/; <DATA> };

while (/
         Name: \s+ (.*?) $                  # $ ends 1st capture at en
+d-of-line
         .*?                                # (ignored)
         Nucleotide \s+ Sequence: \s+ (.*?) # 2nd capture group...
         (?: GeneID | \z)                   #   terminated by 'GeneID'
+ or
                                            #   end-of-string
       /gmsx)
{
    my   ($name, $seq) = ($1, $2);
    chomp($name, $seq);
    print "\n$name\n\n$seq\n";
}

__DATA__
GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

Output:

18:24 >perl 675_SoPW.pl

cadherin 4, type 1, R-cadherin (retinal)

atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa


tetraspanin 32

atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc


tumor suppressing subtransferable candidate 4

atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa

18:24 >
[download]

Note the while loop used in conjunction with the /g modifier on the regex — see “Global matching” in Using regular expressions in Perl.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]