parsing multiple lines

sm2004 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have a text file that needs to be modified. It has blocks of numbered descriptions. I would like to separate the information in different lines and include 'unknown' when a part of the description is missing. Here is a list of steps I would like the script to do:

1. The first word following the number is the locus tag. I want to remove the number and have the locus tag on a separate line (this part works in my code).
2. The rest of the first line is the name. I want this to be in another line with the 'Arabidopsis thaliana' removed (this part works in my code).
3. If the next line starts with KEGG I want all lines until 'function evidence' to be written as a block without "KEGG pathway (this part works in my code). if this part is absent I want to write 'unknown' (need help).
4. Then there should be descriptions for 'function evidence', 'process evidence', and 'component evidence'(in the same order). If any one of these are missing, 'unknown' needs to be written in place of that (need help). Then these subtitles 'function evidence', 'process evidence', and 'component evidence' need to be omitted (this part works in my code).
5. All categories(6): locus tag, name, kegg, function evidence, process evidence, and component evidence needs to be seperated by a new line before writing the next category.

I have managed to get some of the steps, but haven't been able to figure how to write 'unknown' when some of the categories are missing. I'm new to perl and it would be great if someone can help me with this half written script I have. I would like a solution where I don't have to use any modules or subroutines. Any tips/ideas greatly appreciated. Thanks.
here is the input file, desired output and the script I've written:

sample input file:

1: AT2G01060 myb family transcription factor [ Arabidopsis thaliana ] 
Function Evidence
transcription factor activity


Process Evidence
regulation of transcription


Component Evidence
nucleus IEA 

2: AT2G01140 fructose-bisphosphate aldolase, putative [ Arabidopsis th
+aliana ] 
KEGG pathway: Carbon fixation00710KEGG pathway: Fructose and mannose
metabolism00051KEGG pathway: Glycolysis / Gluconeogenesis00010KEGG pat
+hway:
Pentose phosphate pathway00030

Function Evidence
fructose-bisphosphate aldolase activity


Process Evidence
pentose-phosphate shunt TAS 
response to oxidative stress


Component Evidence
chloroplast
mitochondrion
plastoglobule

3: AT2G01275 zinc finger (C3HC4-type RING finger) family protein [ Ara
+bidopsis
Function Evidence
protein binding RCA 
zinc ion binding RCA 

4: AT2G01320 ABC transporter family protein [ Arabidopsis thaliana ] 
Process Evidence
ATPase activity, coupled to transmembrane movement of substances ISS
[download]

desired output

AT2G01060 

myb family transcription factor

Unknown

transcription factor activity

regulation of transcription

nucleus IEA 

AT2G01140 

fructose-bisphosphate aldolase, putative 
 
Carbon fixation00710 Fructose and mannose metabolism00051 Glycolysis /
+ Gluconeogenesis00010 Pentose phosphate pathway00030

fructose-bisphosphate aldolase activity

pentose-phosphate shunt TAS 
response to oxidative stress

chloroplast
mitochondrion
plastoglobule

AT2G01275 

zinc finger (C3HC4-type RING finger) family protein 

Unknown

protein binding RCA 
zinc ion binding RCA 

Unknown

Unknown

AT2G01320 

ABC transporter family protein 

Unknown

Unknown

ATPase activity, coupled to transmembrane movement of substances ISS 

Unknown
[download]

#!/usr/bin/perl   
# to modify cleangene
# an infile (to be read in)and an outfile (to write to)
# and both should be open
$infile = "clean.txt"; #output of batch entrez gene cleaned
open (IN, $infile) or die "can't open file: $!";

$outfile = "genetable.txt";
open (OUT, ">$outfile") or die "can't open file: $!";
# reading one line at a time using the FILE handle

while (<IN>) {
    if ($_ =~ /^(\d+:\s\w.+)/) { # disecting the first line into locus
+ tag and name
    $name = $_;
    $name =~ s/(\[\sArabidopsis\sthaliana\s\])|(\[\sArabidopsis\s)|(\[
+)//;
    $name =~ s/^\d+:\s//;
    @array = split(/\s+/, $name);
    $locus_tag = @array[0];
    print OUT "$locus_tag\n \n";
    $name =~ s/^(AT\w+)|(\w+)//;
    $name =~ s/^\s//;
    print OUT "$name\n";
}
    next if /(^Function\sEvidence)|^(Process\sEvidence)|^(Component\sE
+vidence)|^(\d+:\s\w.+)/;
    
    if ($_ =~ /(KEGG\spathway:)|(\w+\d\d\d\d\d\s)/){ #removing "KEGG p
+athway" from the kegg description
        $kegg = $_,
        $kegg =~ s/^(KEGG\spathway:\s)//;
        $kegg =~ s/KEGG/ KEGG/g;
        $kegg =~ s/(KEGG\spathway:\s)//g;
        print OUT "$kegg";
}    
    else {print OUT $_;}
    
    
    
}
[download]

Comment on parsing multiple lines Select or Download Code

Replies are listed 'Best First'.
Re: parsing multiple lines by toolic (Bishop) on May 21, 2008 at 18:43 UTC
One way to do this is to create a flag or a state variable to remember previous lines. For example, for the KEGG sections, I created a `$kegg_found` variable. Here is the revised version of the code: Read more... (2 kB) This should solve your problem 3. You should be able to use this same principle to solve problem 4. I also made a number of other changes to the code. I added the strictures: `use strict; use warnings;` [download] This produced a warning (which I fixed): `Scalar value @array[0] better written as $array[0]` [download] I declared all variables with my to eliminate the strict compiler errors. I eliminated the capturing parentheses from all your regexes since you were not using the captured values. I eliminated the unnecessary `$_ =~` I closed the open file handles. Hope this helps.	[reply] [d/l] [select]
Re^2: parsing multiple lines by sm2004 (Acolyte) on May 22, 2008 at 00:49 UTC
Thanks so much. Yes, it did help a lot. I can get it to work now.	[reply]
Re: parsing multiple lines by apl (Monsignor) on May 21, 2008 at 18:31 UTC
Whenever you start a new block of text, set all flags (defined below) to zero. 3. If the next line starts with KEGG I want all lines until 'function evidence' to be written as a block without "KEGG pathway (this part works in my code). if this part is absent I want to write 'unknown' (need help). Define "part". The line starting "KEGG"? A line starting with "KEGG" that doesn't have a pathway? If you find the desired "part", set $kegg to 1. When you hit "function evidence", set $kegg to zero. Whenever you read a line, if $kegg is set to 1, print it out. 4. Then there should be descriptions for 'function evidence', 'process evidence', and 'component evidence'(in the same order). If any one of these are missing, 'unknown' needs to be written in place of that (need help). Set a flag when each of the key phrases is encountered. When a line is a key phrase, and the flag is set, display "unknown". Otherwise, display the contents of the line. Clear the flag.	[reply]
Re^2: parsing multiple lines by sm2004 (Acolyte) on May 22, 2008 at 00:46 UTC
Thanks for the ideas.	[reply]
Re: parsing multiple lines by psini (Deacon) on May 21, 2008 at 18:58 UTC
An alternative approach could be to see the parsing of your file as a state machine: initialize $status=0 and then, for each line from the file, check for the "type" of the line (title line,name, kegg, function evidence, process evidence, component evidence, other). switch on the line type and do as follow: title line: if $status>0 call the output function (see later); then extract locus tag and name to two vars, initialize $kegg, $function, $process, $component as "unknown", set $status=1. kegg: strip the "KEGG pathway:" portion of the line and put the remainder in $kegg, set $status=2. function evidence: set $function='' and $status=3. process evidence: set $process='' and $status=4. component evidence: set $component='' and $status=5. other: depending on the value of $status (between 2 and 5) add the line to the corresponding var. If status<2 do nothing. At end of file, if $status>0 call again the output function (this is needed to output the last block). The output function should take the values stored in the 6 vars and print them to the output file Rule One: Do not act incautiously when confronting a little bald wrinkly smiling man.	[reply]
Re^2: parsing multiple lines by sm2004 (Acolyte) on May 22, 2008 at 00:51 UTC
I could definitely use this idea for some of my other scripts too. Thanks a lot.	[reply]
Re: parsing multiple lines by pc88mxer (Vicar) on May 21, 2008 at 19:13 UTC
My preferred approach is basically the same as what has been said above, but I like to use a hash ref to keep track of the parsed information. Another way to look at this problem is that the file is a sequence of records, and after you parse a complete record you want to 'process' it in some fashion. The typical way to do this is: `my $r = {}; # hash to hold the parsed record while (<IN>) { if (/^(\d+).../) { # found beginning of new record if ($r->{id}) { process($r); } $r = {}; # begin new record $r->{id} = $1; # populate parsed info from this line } elsif (/KEGG.../) { $r->{kegg} = ...; } elsif ... } } if ($r->{id}) { process($r) }; sub process { my $r = shift; ... }` [download] By changing the `process` subroutine you can re-use this code to perform different kinds of analyses on the file.	[reply] [d/l] [select]
Re^2: parsing multiple lines by GrandFather (Saint) on May 21, 2008 at 21:19 UTC
Why a hash ref rather than a hash? Surely it is simpler and clearer to write: `my %record; while (<IN>) { if (/^(\d+).../) { # found beginning of new record process (\%record) if $record{id}; %record = (); # Flush old record $record{id} = $1; # populate parsed info from this line ...` [download] Perl is environmentally friendly - it saves trees	[reply] [d/l]
Re^2: parsing multiple lines by sm2004 (Acolyte) on May 22, 2008 at 00:53 UTC
Thanks a lot. I'm learning from all alternative ideas.	[reply]
Re: parsing multiple lines by mwah (Hermit) on May 21, 2008 at 19:24 UTC
if it's an option to slurp the file before processing, like: `... my $fn = 'data.dat'; my $stuff; open my$fh, '<', 'data.dat' or die "$fn - $!"; { local $/; $stuff = <$fh> } close $fh; ...` [download] then the program could be simplified like: ... while( $stuff =~ /^ (\d+): \s+ # the number xx: => $1 (\w+) \s+ # the locus => $2 ( (?:.(?!^\d+:))+ ) # the remaining record => $3 /msgx ) { my ($locus, $name, $record, $kegg, $func, $proc, $comp) = ($2, '', +$3, ('unknown')x4); $name = $1 if $record =~ /([^\n\[]+)\s*/; $kegg = $1 if $record =~ /KEGG \s+ pathway: \s+ (.+?)\s+Function \s ++ Evidence/sx; $func = $1 if $record =~ /Function \s+ Evidence \s+ (.+?) (?:\n\n\| +\z)/sx; $proc = $1 if $record =~ /Process \s+ Evidence \s+ (.+?) (?:\n\n\| +\z)/sx; $comp = $1 if $record =~ /Component \s+ Evidence \s+ (.+?) (?:\n\n\| +\z)/sx; print join "\n\n", $locus, $name, $kegg, $func, $proc, $comp; } ... [download] Regards mwa	[reply] [d/l] [select]
Re^2: parsing multiple lines by sm2004 (Acolyte) on May 22, 2008 at 01:07 UTC
Thanks. I appreciate all the suggestions and help.	[reply]
Re^2: parsing multiple lines by sm2004 (Acolyte) on May 28, 2008 at 22:52 UTC
Thanks a lot. I'm using your script now as it's easy to modify some of my other text files to work with this script/idea. I didn't understand some of the syntax before (I'm new to perl) but I get it now. Thanks again.	[reply]
Re: parsing multiple lines by hesco (Deacon) on May 21, 2008 at 19:19 UTC
The advice above about setting flags to track what sort of input you are currently examining and using those flags to drive what and how to write your output is on the mark and exactly how I approach this issue when writing www scraping tools. The only thing I want to add is a comment to your remark, that you "would like a solution where I don't have to use any modules or subroutines." Modules and subroutines are how we re-use code. Code re-use is essential to writing testable code, minimizing bugs and copy-n-paste errors, expediting development and refactoring, avoid re-inventing wheels and leverage the work and wisdom of others to serve our own applications. Don't fear code re-use. It will make you a stronger developer. -- Hugh if( $lal && $lol ) { $life++; }	[reply]
Re: parsing multiple lines by Anonymous Monk on May 22, 2008 at 21:46 UTC
Wouldn't it be easier to use their API? http://www.genome.jp/kegg/soap/ You just need to use SOAP lite to handle their KEGG.wsdl file and retrieve the exact information you want directly from their database. It's well documented and has several examples on how to use. Writing the parser is of course a nice exercise but you should be aware that they change their output from time to time and therefore your parser might eventualy break.	[reply]