Murcia has asked for the wisdom of the Perl Monks concerning the following question:

hi conferes,

I want to extract regions from a file.

>lmo0024        16802072 upstream sequence, from -300 to +3, size 304
ctttcggacaaagcgtggttgattttattcttaacgaaattccagaatggctaatgggtg
caaaaatccagtagctgcaggagcaaatggtgcagcgctagttggggaggaga
atga
>lmo0025        16802073 upstream sequence, from -27 to +3, size 31
aatataaaaattggaggaatagacaaaatgg
.
.
.
The regions are between the >lmo numbers (inclusive)
I know that
while(<>){ if(/begin/ ... /end/){ do } }
extracts regions
but how to do in this case? Thanks for tips! Murcia

Replies are listed 'Best First'.
Re: extract regions
by Roger (Parson) on Feb 09, 2004 at 10:43 UTC
    If all you want to do is to extract data between each >lmo, you could do this instead:
    local $/ = '>lmo'; while (<DATA>) { s/\n*?>lmo//; # add s/\n//g; if you want to combine lines next if !$_; print "--- REGION ---\n", $_, "\n"; } __DATA__ >lmo0024 16802072 upstream sequence, from -300 to +3, size 304 ctttcggacaaagcgtggttgattttattcttaacgaaattccagaatggctaatgggtg caaaaatccagtagctgcaggagcaaatggtgcagcgctagttggggaggaga atga >lmo0025 16802073 upstream sequence, from -27 to +3, size 31 aatataaaaattggaggaatagacaaaatgg
    And the output -
    --- REGION --- 0024 16802072 upstream sequence, from -300 to +3, size 304 ctttcggacaaagcgtggttgattttattcttaacgaaattccagaatggctaatgggtg caaaaatccagtagctgcaggagcaaatggtgcagcgctagttggggaggaga atga --- REGION --- 0025 16802073 upstream sequence, from -27 to +3, size 31 aatataaaaattggaggaatagacaaaatgg
Re: extract regions
by ysth (Canon) on Feb 09, 2004 at 10:46 UTC
    We had basically this same question at parsing question. See some of the answers there (particularly the reference to bioperl).
Re: extract regions
by Abigail-II (Bishop) on Feb 09, 2004 at 10:39 UTC
    It's not clear what you want. A construct like:
    while (<>) { if (/begin/ ... /end/) { ... } }
    still reads the entire file, line-by-line. If your file consists of 'records' that start with ^lmo\d+, and end with the next occurence, playing with the range operator doesn't make much sense; you might as well do a straigth while (<>) { ... } loop.

    What is it what you really want?

    Abigail

Re: extract regions
by flounder99 (Friar) on Feb 09, 2004 at 16:01 UTC
    Let's say you want from >lmo0025 to >lmo0026 inclusive. If you know that the next record will be >lmo0027 you can do something like this:
    use strict; while (<>) { if (/^>lmo0025/ .. /^>lmo0027/) { last if /^>lmo0027/; print; } }
    If you don't know that the next record is >lmo0027 you can do something like this:
    use strict; my ($firstfoundflag, $lastfoundflag); while (<>) { if ($firstfoundflag || /^>lmo0025/) { $firstfoundflag++; if ($lastfoundflag || /^>lmo0026/ ) { $lastfoundflag++; if (/^>lmo(?!0026)/) { last; } } print; } }

    --

    flounder