pdotcdot has asked for the wisdom of the Perl Monks concerning the following question:

hi i have been trying to extract a multiline sequence which consists of a string of letters spaning several lines followed by a blank line, then a header sequence ie
*header abcdefghik sdsaadsd addds *header a....
so far i have used the following, but it still only matches 1 line at a time! ie when i use the the array seqs later each $i element is only 1 line. i am out of books/ideas so all help would be appreciate.
while ($file=<INPUT1>){ chomp $line; if($file =~/^[a-z](.*?)(?=(\z))/msg){ $seq=$1; $seq=~ s/\n//; push @seqs, $seq; print OUTFILE2 @seqs; exit; } elsif($file =~/*/){ $header =$fastafile; chomp $header; $header=~s/>//; push @headers, $header; print OUTFILE "$header\n"; }
many thanks in advance, i know this is novice question :-) PC

Replies are listed 'Best First'.
Re: multi-line match
by broquaint (Abbot) on Aug 05, 2003 at 15:45 UTC
    If your data is delimited by a blank line then you can change the $/ (input record seperator) variable to read in chunks of the file up to every double newline. Something like this should get you started on parsing your file
    { local $/ = "\n\n"; while(<INPUT1>) { my($head,$data) = m< ^ \* ([^\n]+) \n (.*?) $/ >xm; push @seqs => [ split /\n/, $data ]; push @headers => $head; print OUTFILE "$head\n"; print OUTFILE2 @{ $seqs[-1] }, "\n"; } }
    The above code should save the data into @seqs which will be an array of arrasy, and the headers into @headers as a simple array, all the while printing the data into OUTFILE2 and the headers into OUTFILE. A couple of errors in your code were that you were exitting in the case of the first condition (maybe you meant next?) and you forgot to escape * in the second condition, which should've triggered a compile-time error.
    HTH

    _________
    broquaint

Re: multi-line match
by BrowserUk (Patriarch) on Aug 05, 2003 at 16:14 UTC

    If there is any chance of there being more than one blank line between your records, you should set $/=''; to enable "paragraph mode" in preference to "\n\n".

    To quote from perlvar:$INPUT_RECORD_SEPERATOR:

    Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline.

    If your files are not too big, you could also set local $/=undef; (or simply, local $/;) to read the whole file into a scalar and then use m//g in a while loop to process the records.

    $s = "abc\ncd\ne\n\npqr\nst\nf\n\n"; while( $s =~ m[ (\w+) \n (\w+) \n (\w+) ]gx ) { print "$1:$2:$3"; } abc:cd:e pqr:st:f

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: multi-line match
by JamesNC (Chaplain) on Aug 05, 2003 at 17:02 UTC
    It looks like you want the stuff between the headers as message, so this simple deal will build an array with indexes based on headers, if you want to preserve the newlines, just remove the chomp, otherwise process the array as you like:
    use strict; my ($line, $i); my @messages; while(<DATA>){ my $hdr = 0; if(/header/){ $hdr = 1; $i++; } unless($hdr){ chomp; $messages[$i-1] .= "$_ "; } } print "$_ \n" for @messages; __DATA__ header This is messge one line 1. This is message one line 2. header Yet some more example text. Blah, blah.. Line 2 more blah, blah..

    JamesNC
Re: multi-line match
by pdotcdot (Acolyte) on Aug 06, 2003 at 14:22 UTC
    Thanks guys, i'll try out these different options! The errors in the code were cut and paste errors/debugging stuff i was trying out. cheers PC
Re: multi-line match
by pdotcdot (Acolyte) on Aug 07, 2003 at 10:01 UTC
    hi, i have recieved a new file which contains no new lines between entries eg
    *header asdasdddas asdsadds *header sdsdasd asdsdds
    I have tried modding the suggestions above but they all come back with garbage, and i have super searched as well to no avail. sorry to ask for help gain so soon, but time is pressing! Thanks in advance PC

      The easiest way would be to use two passes. The first adds a blank line before each header:

      perl -ple"$_ = qq[\n] . $_ if /^*header/" infile >modified

      NB! Different quotes on *nix!

      The second pass is just which ever of the earlier answers you like best.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      If I understand your problem, I can solve it! Of course, the same can be said for you.

        Thanks very much BrowserUk and all monks, i'm still sorting out different multiline queries but i am determined to get there! thanks again!