comment on

Dear All,

I am trying to parse a file in a while loop and printing some matched regular expression parameters.

Below is my code and data file

my $filename =  test.summary";
open (IN, "<", $filename) or die "Check the summary file. $!\n";
while (my $line = <IN>) {
    chomp $line;
    if ($line =~/^LOCUS\s+\w+\d+\s+(\d+)\sbp/) {
        $gene_length = $1;
    }
    if ($line =~/^DEFINITION\s+(.*)/s) {
        $definition = $1;
    }
    if ($line =~/^ACCESSION\s+(.*?)\s+/) {
        $accession = $1;
    }
    if ($line =~ /\s+\/db_xref="GI\:(\d+)\"/) {
        $gi_number = $1; 
    }
    if ($line =~ /\s+\/db_xref=\"GeneID\:(\d+)\"/) {
        $gene_id = $1;
    }
}
[download]

 
Data file:
LOCUS       NM_001098209            3415 bp    mRNA    linear   PRI 27
+-APR-2014
DEFINITION  Homo sapiens catenin (cadherin-associated protein), beta 1
+, 88kDa
            (CTNNB1), transcript variant 2, mRNA.
ACCESSION   NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001
+133675
VERSION     NM_001098209.1  GI:148233337
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele
+ostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin
+i;
            Catarrhini; Hominidae; Homo.
     CDS             269..2614
                     /gene="CTNNB1"
                     /gene_synonym="armadillo; CTNNB; MRD19"
                     /codon_start=1
                     /product="catenin beta-1"
                     /protein_id="NP_001091679.1"
                     /db_xref="GI:148233338"
                     /db_xref="CCDS:CCDS2694.1"
                     /db_xref="GeneID:1499"
                     /db_xref="HGNC:HGNC:2514"
                     /db_xref="MIM:116806"
                     /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI
+HSGATTTAP
                     SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR
+VRAAMFPET
                     LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR
+AIPELTKLL
//
[download]

My questions:

a) How can I parse the multiline DEFINITION in the while loop as the regular expression captures only the first line .

b) Could I get some help in capuring the content of CDS block and then parse individual entries one by one( like GI, GeneID etc.).

I am trying to learn using Perl only so I am not using the BioPerl module for the above purpose.

Regards

In reply to multiline in while loop and regular expression by newtoperlprog

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.