comment on

Is that the entire contents of the file, or is that one record of many? If there are many records, what does a record separator look like? Maybe you need to show us two records?

Are ORGANISM and CDS really indented like that? Are there other field types you haven't told us about? Is the // actually part of the file, or did it just happen to "slip in" while you weren't looking?

Maybe the following parsing code will get you started:

use strict;
use warnings;

my @records;
my $currTail;
my $currField;

while (defined(my $line = <DATA>) or defined $currField) {
    my $field;
    my $tail;

    ($field, $tail) = $line =~ /^(.{10})  (.*)/ if defined $line;
    next if !defined $tail && !defined $currField;

    $field =~ tr/ //d if defined $field;
    $currField //= $field;

    if (! defined $field or (length $field && $currField ne $field)) {
        push @records, {} if $currField eq 'LOCUS';
        $records[-1]{$currField} = $currTail;
        $currField               = undef;
        $currTail                = undef;
        last if !defined $tail;
    }
    
    $currField = $field if length $field;
    push @$currTail, $tail if defined $tail;
}

for my $record (@records) {
    print "$_:\n", map{"   $_\n"} @{$record->{$_}} for sort keys %$rec
+ord;
}


__DATA__
LOCUS       NM_001098209            3415 bp    mRNA    linear   PRI 27
+-APR-2014
DEFINITION  Homo sapiens catenin (cadherin-associated protein), beta 1
+, 88kDa
            (CTNNB1), transcript variant 2, mRNA.
ACCESSION   NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001
+133675
VERSION     NM_001098209.1  GI:148233337
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutele
+ostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhin
+i;
            Catarrhini; Hominidae; Homo.
     CDS             269..2614
                     /gene="CTNNB1"
                     /gene_synonym="armadillo; CTNNB; MRD19"
                     /codon_start=1
                     /product="catenin beta-1"
                     /protein_id="NP_001091679.1"
                     /db_xref="GI:148233338"
                     /db_xref="CCDS:CCDS2694.1"
                     /db_xref="GeneID:1499"
                     /db_xref="HGNC:HGNC:2514"
                     /db_xref="MIM:116806"
                     /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGI
+HSGATTTAP
                     SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQR
+VRAAMFPET
                     LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATR
+AIPELTKLL
//
[download]

Prints:

ACCESSION:
   NM_001098209 XM_001133660 XM_001133664 XM_001133673 XM_001133675
CDS:
            269..2614
            /gene="CTNNB1"
            /gene_synonym="armadillo; CTNNB; MRD19"
            /codon_start=1
            /product="catenin beta-1"
            /protein_id="NP_001091679.1"
            /db_xref="GI:148233338"
            /db_xref="CCDS:CCDS2694.1"
            /db_xref="GeneID:1499"
            /db_xref="HGNC:HGNC:2514"
            /db_xref="MIM:116806"
            /translation="MATQADLMELDMAMEPDRKAAVSHWQQQSYLDSGIHSGATTTAP
            SLSGKGNPEEEDVDTSQVLYEWEQGFSQSFTQEQVADIDGQYAMTRAQRVRAAMFPET
            LDEGMQIPSTQFDAAHPTNVQRLAEPSQMLKHAVVNLINYQDDAELATRAIPELTKLL
DEFINITION:
   Homo sapiens catenin (cadherin-associated protein), beta 1, 88kDa
   (CTNNB1), transcript variant 2, mRNA.
KEYWORDS:
   RefSeq.
LOCUS:
   NM_001098209            3415 bp    mRNA    linear   PRI 27-APR-2014
ORGANISM:
   Homo sapiens
   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
   Catarrhini; Hominidae; Homo.
SOURCE:
   Homo sapiens (human)
VERSION:
   NM_001098209.1  GI:148233337
[download]

Perl is the programming world's equivalent of English

In reply to Re: multiline in while loop and regular expression by GrandFather
in thread multiline in while loop and regular expression by newtoperlprog

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.