in reply to multiline in while loop and regular expression
Is that the entire contents of the file, or is that one record of many? If there are many records, what does a record separator look like? Maybe you need to show us two records?
Are ORGANISM and CDS really indented like that? Are there other field types you haven't told us about? Is the // actually part of the file, or did it just happen to "slip in" while you weren't looking?
Maybe the following parsing code will get you started:
use strict; use warnings; my @records; my $currTail; my $currField; while (defined(my $line = <DATA>) or defined $currField) { my $field; my $tail; ($field, $tail) = $line =~ /^(.{10}) (.*)/ if defined $line; next if !defined $tail && !defined $currField; $field =~ tr/ //d if defined $field; $currField //= $field; if (! defined $field or (length $field && $currField ne $field)) { push @records, {} if $currField eq 'LOCUS'; $records[-1]{$currField} = $currTail; $currField = undef; $currTail = undef; last if !defined $tail; } $currField = $field if length $field; push @$currTail, $tail if defined $tail; } for my $record (@records) { print "$_:\n", map{" $_\n"} @{$record->{$_}} for sort keys %$rec +ord; } __DATA__ LOCUS NM_001098209 3415 bp mRNA linear PRI 27 +-APR-2014 DEFINITION Homo sapiens catenin (cadherin-associated protein), beta 1 +, 88kDa (CTNNB1), transcript variant 2, mRNA. ACCESSION NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001 +133675 VERSION NM_001098209.1 GI:148233337 KEYWORDS RefSeq. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele +ostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin +i; Catarrhini; Hominidae; Homo. CDS 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI +HSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR +VRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR +AIPELTKLL //
Prints:
ACCESSION: NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001133675 CDS: 269..2614 /gene="CTNNB1" /gene_synonym="armadillo; CTNNB; MRD19" /codon_start=1 /product="catenin beta-1" /protein_id="NP_001091679.1" /db_xref="GI:148233338" /db_xref="CCDS:CCDS2694.1" /db_xref="GeneID:1499" /db_xref="HGNC:HGNC:2514" /db_xref="MIM:116806" /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGIHSGATTTAP SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQRVRAAMFPET LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATRAIPELTKLL DEFINITION: Homo sapiens catenin (cadherin-associated protein), beta 1, 88kDa (CTNNB1), transcript variant 2, mRNA. KEYWORDS: RefSeq. LOCUS: NM_001098209 3415 bp mRNA linear PRI 27-APR-2014 ORGANISM: Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. SOURCE: Homo sapiens (human) VERSION: NM_001098209.1 GI:148233337
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: multiline in while loop and regular expression
by newtoperlprog (Sexton) on Nov 24, 2014 at 22:33 UTC | |
by GrandFather (Saint) on Nov 24, 2014 at 23:58 UTC |