sm2004 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have a text file that needs to be modified. It has blocks of numbered descriptions. I would like to separate the information in different lines and include 'unknown' when a part of the description is missing. Here is a list of steps I would like the script to do:
1. The first word following the number is the locus tag. I want to remove the number and have the locus tag on a separate line (this part works in my code).
2. The rest of the first line is the name. I want this to be in another line with the 'Arabidopsis thaliana' removed (this part works in my code).
3. If the next line starts with KEGG I want all lines until 'function evidence' to be written as a block without "KEGG pathway (this part works in my code). if this part is absent I want to write 'unknown' (need help).
4. Then there should be descriptions for 'function evidence', 'process evidence', and 'component evidence'(in the same order). If any one of these are missing, 'unknown' needs to be written in place of that (need help). Then these subtitles 'function evidence', 'process evidence', and 'component evidence' need to be omitted (this part works in my code).
5. All categories(6): locus tag, name, kegg, function evidence, process evidence, and component evidence needs to be seperated by a new line before writing the next category.
I have managed to get some of the steps, but haven't been able to figure how to write 'unknown' when some of the categories are missing. I'm new to perl and it would be great if someone can help me with this half written script I have. I would like a solution where I don't have to use any modules or subroutines. Any tips/ideas greatly appreciated. Thanks.
here is the input file, desired output and the script I've written:
sample input file: 1: AT2G01060 myb family transcription factor [ Arabidopsis thaliana ] Function Evidence transcription factor activity Process Evidence regulation of transcription Component Evidence nucleus IEA 2: AT2G01140 fructose-bisphosphate aldolase, putative [ Arabidopsis th +aliana ] KEGG pathway: Carbon fixation00710KEGG pathway: Fructose and mannose metabolism00051KEGG pathway: Glycolysis / Gluconeogenesis00010KEGG pat +hway: Pentose phosphate pathway00030 Function Evidence fructose-bisphosphate aldolase activity Process Evidence pentose-phosphate shunt TAS response to oxidative stress Component Evidence chloroplast mitochondrion plastoglobule 3: AT2G01275 zinc finger (C3HC4-type RING finger) family protein [ Ara +bidopsis Function Evidence protein binding RCA zinc ion binding RCA 4: AT2G01320 ABC transporter family protein [ Arabidopsis thaliana ] Process Evidence ATPase activity, coupled to transmembrane movement of substances ISS

desired output
AT2G01060 myb family transcription factor Unknown transcription factor activity regulation of transcription nucleus IEA AT2G01140 fructose-bisphosphate aldolase, putative Carbon fixation00710 Fructose and mannose metabolism00051 Glycolysis / + Gluconeogenesis00010 Pentose phosphate pathway00030 fructose-bisphosphate aldolase activity pentose-phosphate shunt TAS response to oxidative stress chloroplast mitochondrion plastoglobule AT2G01275 zinc finger (C3HC4-type RING finger) family protein Unknown protein binding RCA zinc ion binding RCA Unknown Unknown AT2G01320 ABC transporter family protein Unknown Unknown ATPase activity, coupled to transmembrane movement of substances ISS Unknown
#!/usr/bin/perl # to modify cleangene # an infile (to be read in)and an outfile (to write to) # and both should be open $infile = "clean.txt"; #output of batch entrez gene cleaned open (IN, $infile) or die "can't open file: $!"; $outfile = "genetable.txt"; open (OUT, ">$outfile") or die "can't open file: $!"; # reading one line at a time using the FILE handle while (<IN>) { if ($_ =~ /^(\d+:\s\w.+)/) { # disecting the first line into locus + tag and name $name = $_; $name =~ s/(\[\sArabidopsis\sthaliana\s\])|(\[\sArabidopsis\s)|(\[ +)//; $name =~ s/^\d+:\s//; @array = split(/\s+/, $name); $locus_tag = @array[0]; print OUT "$locus_tag\n \n"; $name =~ s/^(AT\w+)|(\w+)//; $name =~ s/^\s//; print OUT "$name\n"; } next if /(^Function\sEvidence)|^(Process\sEvidence)|^(Component\sE +vidence)|^(\d+:\s\w.+)/; if ($_ =~ /(KEGG\spathway:)|(\w+\d\d\d\d\d\s)/){ #removing "KEGG p +athway" from the kegg description $kegg = $_, $kegg =~ s/^(KEGG\spathway:\s)//; $kegg =~ s/KEGG/ KEGG/g; $kegg =~ s/(KEGG\spathway:\s)//g; print OUT "$kegg"; } else {print OUT $_;} }

Replies are listed 'Best First'.
Re: parsing multiple lines
by toolic (Bishop) on May 21, 2008 at 18:43 UTC
    One way to do this is to create a flag or a state variable to remember previous lines. For example, for the KEGG sections, I created a $kegg_found variable. Here is the revised version of the code:

    This should solve your problem 3. You should be able to use this same principle to solve problem 4.

    I also made a number of other changes to the code. I added the strictures:

    use strict; use warnings;

    This produced a warning (which I fixed):

    Scalar value @array[0] better written as $array[0]

    I declared all variables with my to eliminate the strict compiler errors.

    I eliminated the capturing parentheses from all your regexes since you were not using the captured values.

    I eliminated the unnecessary $_ =~

    I closed the open file handles.

    Hope this helps.

      Thanks so much. Yes, it did help a lot. I can get it to work now.
Re: parsing multiple lines
by apl (Monsignor) on May 21, 2008 at 18:31 UTC
    Whenever you start a new block of text, set all flags (defined below) to zero.
    3. If the next line starts with KEGG I want all lines until 'function evidence' to be written as a block without "KEGG pathway (this part works in my code). if this part is absent I want to write 'unknown' (need help).
    Define "part". The line starting "KEGG"? A line starting with "KEGG" that doesn't have a pathway?

    If you find the desired "part", set $kegg to 1.

    When you hit "function evidence", set $kegg to zero.

    Whenever you read a line, if $kegg is set to 1, print it out.

    4. Then there should be descriptions for 'function evidence', 'process evidence', and 'component evidence'(in the same order). If any one of these are missing, 'unknown' needs to be written in place of that (need help).
    Set a flag when each of the key phrases is encountered. When a line is a key phrase, and the flag is set, display "unknown". Otherwise, display the contents of the line. Clear the flag.
      Thanks for the ideas.
Re: parsing multiple lines
by psini (Deacon) on May 21, 2008 at 18:58 UTC

    An alternative approach could be to see the parsing of your file as a state machine: initialize $status=0 and then, for each line from the file, check for the "type" of the line (title line,name, kegg, function evidence, process evidence, component evidence, other).

    switch on the line type and do as follow:
    title line: if $status>0 call the output function (see later); then extract locus tag and name to two vars, initialize $kegg, $function, $process, $component as "unknown", set $status=1.
    kegg: strip the "KEGG pathway:" portion of the line and put the remainder in $kegg, set $status=2.
    function evidence: set $function='' and $status=3.
    process evidence: set $process='' and $status=4.
    component evidence: set $component='' and $status=5.
    other: depending on the value of $status (between 2 and 5) add the line to the corresponding var. If status<2 do nothing.

    At end of file, if $status>0 call again the output function (this is needed to output the last block).

    The output function should take the values stored in the 6 vars and print them to the output file

    Rule One: Do not act incautiously when confronting a little bald wrinkly smiling man.

      I could definitely use this idea for some of my other scripts too. Thanks a lot.
Re: parsing multiple lines
by pc88mxer (Vicar) on May 21, 2008 at 19:13 UTC
    My preferred approach is basically the same as what has been said above, but I like to use a hash ref to keep track of the parsed information. Another way to look at this problem is that the file is a sequence of records, and after you parse a complete record you want to 'process' it in some fashion. The typical way to do this is:
    my $r = {}; # hash to hold the parsed record while (<IN>) { if (/^(\d+).../) { # found beginning of new record if ($r->{id}) { process($r); } $r = {}; # begin new record $r->{id} = $1; # populate parsed info from this line } elsif (/KEGG.../) { $r->{kegg} = ...; } elsif ... } } if ($r->{id}) { process($r) }; sub process { my $r = shift; ... }
    By changing the process subroutine you can re-use this code to perform different kinds of analyses on the file.

      Why a hash ref rather than a hash? Surely it is simpler and clearer to write:

      my %record; while (<IN>) { if (/^(\d+).../) { # found beginning of new record process (\%record) if $record{id}; %record = (); # Flush old record $record{id} = $1; # populate parsed info from this line ...

      Perl is environmentally friendly - it saves trees
      Thanks a lot. I'm learning from all alternative ideas.
Re: parsing multiple lines
by mwah (Hermit) on May 21, 2008 at 19:24 UTC

    if it's an option to slurp the file before processing, like:

    ... my $fn = 'data.dat'; my $stuff; open my$fh, '<', 'data.dat' or die "$fn - $!"; { local $/; $stuff = <$fh> } close $fh; ...

    then the program could be simplified like:

    ... while( $stuff =~ /^ (\d+): \s+ # the number xx: => $1 (\w+) \s+ # the locus => $2 ( (?:.(?!^\d+:))+ ) # the remaining record => $3 /msgx ) { my ($locus, $name, $record, $kegg, $func, $proc, $comp) = ($2, '', +$3, ('unknown')x4); $name = $1 if $record =~ /([^\n\[]+)\s*/; $kegg = $1 if $record =~ /KEGG \s+ pathway: \s+ (.+?)\s+Function \s ++ Evidence/sx; $func = $1 if $record =~ /Function \s+ Evidence \s+ (.+?) (?:\n\n| +\z)/sx; $proc = $1 if $record =~ /Process \s+ Evidence \s+ (.+?) (?:\n\n| +\z)/sx; $comp = $1 if $record =~ /Component \s+ Evidence \s+ (.+?) (?:\n\n| +\z)/sx; print join "\n\n", $locus, $name, $kegg, $func, $proc, $comp; } ...

    Regards

    mwa

      Thanks. I appreciate all the suggestions and help.
      Thanks a lot. I'm using your script now as it's easy to modify some of my other text files to work with this script/idea. I didn't understand some of the syntax before (I'm new to perl) but I get it now. Thanks again.
Re: parsing multiple lines
by hesco (Deacon) on May 21, 2008 at 19:19 UTC
    The advice above about setting flags to track what sort of input you are currently examining and using those flags to drive what and how to write your output is on the mark and exactly how I approach this issue when writing www scraping tools. The only thing I want to add is a comment to your remark, that you "would like a solution where I don't have to use any modules or subroutines."

    Modules and subroutines are how we re-use code. Code re-use is essential to writing testable code, minimizing bugs and copy-n-paste errors, expediting development and refactoring, avoid re-inventing wheels and leverage the work and wisdom of others to serve our own applications. Don't fear code re-use. It will make you a stronger developer.

    -- Hugh

    if( $lal && $lol ) { $life++; }
Re: parsing multiple lines
by Anonymous Monk on May 22, 2008 at 21:46 UTC
    Wouldn't it be easier to use their API?
    http://www.genome.jp/kegg/soap/

    You just need to use SOAP lite to handle their KEGG.wsdl file and retrieve the exact information you want directly from their database.

    It's well documented and has several examples on how to use.

    Writing the parser is of course a nice exercise but you should be aware that they change their output from time to time and therefore your parser might eventualy break.