Re^2: Extracting multiple rows in a text file with a regex.

Hi Loops.

The input file is formatted in the following manner:

GeneID: 1002
Name: cadherin 4, type 1, R-cadherin (retinal)
Chromo: 20
Cytoband: 20q13.3
Nucleotide Sequence: atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggc
acagcgagactggagatatcgtcacagtggcggctggcctggaccgagagaaagttcagcagtacacag
cagcttgcgcatcctgtacctggaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

GeneID: 10077
Name: tetraspanin 32
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

GeneID: 10078
Name: tumor suppressing subtransferable candidate 4
Chromo: 11
Cytoband: 11p15.5
Nucleotide Sequence: atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

Comment on Re^2: Extracting multiple rows in a text file with a regex. Download Code

Replies are listed 'Best First'.
Re^3: Extracting multiple rows in a text file with a regex. by Loops (Curate) on Jul 28, 2013 at 08:24 UTC
Hey there. So a common trick is to use the $/ perl variable. It enables you to read input data record-by-record instead of by line. When set to "" the input will be read as blocks that are separated by empty lines. So, with the input file you provided: `use strict; use autodie; open my $FH, '<', 'test.dat'; open my $FH2, '>', 'seq.dat'; for (do {local $/ = ""; <$FH>}) { my ($name) = /Name:\s+(.+)/; my ($seq) = /Nucleotide\sSequence:\s(.*)/xms; $seq =~ s/\n//g; print $FH2 "$name\n$seq\n\n"; }` [download] Produces: `cadherin 4, type 1, R-cadherin (retinal) atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggcacagcgagactggagatatcgt +cacagtggcggctggcctggaccgagagaaagttcagcagtacacagcagcttgcgcatcctgtacctg +gaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa tetraspanin 32 atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc tumor suppressing subtransferable candidate 4 atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Extracting multiple rows in a text file with a regex.
by Loops (Curate) on Jul 28, 2013 at 08:24 UTC

Hey there. So a common trick is to use the $/ perl variable. It enables you to read input data record-by-record instead of by line. When set to "" the input will be read as blocks that are separated by empty lines. So, with the input file you provided:

use strict;
use autodie;

open my $FH, '<', 'test.dat';
open my $FH2, '>', 'seq.dat';

for (do {local $/ = ""; <$FH>}) {
    my ($name) = /Name:\s+(.+)/;
    my ($seq) = /Nucleotide\sSequence:\s(.*)/xms;
    $seq =~ s/\n//g;
    print $FH2 "$name\n$seq\n\n";
}
[download]

cadherin 4, type 1, R-cadherin (retinal)
atgaccgcgggcgccggcgtgctccttctgctgctctcgctctccggcacagcgagactggagatatcgt
+cacagtggcggctggcctggaccgagagaaagttcagcagtacacagcagcttgcgcatcctgtacctg
+gaggccgggatgtatgacgtccccatcatcgtcacagactctggaaa

tetraspanin 32
atggggccttggagtcgagtcagggttgccaaatgccagatgctggtc

tumor suppressing subtransferable candidate 4
atggctgaggcaggaacaggtgagccgtcccccagcgtggagggcgaa
[download]

[reply]
[d/l]
[select]