comment on

Hi,
I have a text file that needs to be modified. It has blocks of numbered descriptions. I would like to separate the information in different lines and include 'unknown' when a part of the description is missing. Here is a list of steps I would like the script to do:

1. The first word following the number is the locus tag. I want to remove the number and have the locus tag on a separate line (this part works in my code).
2. The rest of the first line is the name. I want this to be in another line with the 'Arabidopsis thaliana' removed (this part works in my code).
3. If the next line starts with KEGG I want all lines until 'function evidence' to be written as a block without "KEGG pathway (this part works in my code). if this part is absent I want to write 'unknown' (need help).
4. Then there should be descriptions for 'function evidence', 'process evidence', and 'component evidence'(in the same order). If any one of these are missing, 'unknown' needs to be written in place of that (need help). Then these subtitles 'function evidence', 'process evidence', and 'component evidence' need to be omitted (this part works in my code).
5. All categories(6): locus tag, name, kegg, function evidence, process evidence, and component evidence needs to be seperated by a new line before writing the next category.

I have managed to get some of the steps, but haven't been able to figure how to write 'unknown' when some of the categories are missing. I'm new to perl and it would be great if someone can help me with this half written script I have. I would like a solution where I don't have to use any modules or subroutines. Any tips/ideas greatly appreciated. Thanks.
here is the input file, desired output and the script I've written:

sample input file:

1: AT2G01060 myb family transcription factor [ Arabidopsis thaliana ] 
Function Evidence
transcription factor activity


Process Evidence
regulation of transcription


Component Evidence
nucleus IEA 

2: AT2G01140 fructose-bisphosphate aldolase, putative [ Arabidopsis th
+aliana ] 
KEGG pathway: Carbon fixation00710KEGG pathway: Fructose and mannose
metabolism00051KEGG pathway: Glycolysis / Gluconeogenesis00010KEGG pat
+hway:
Pentose phosphate pathway00030

Function Evidence
fructose-bisphosphate aldolase activity


Process Evidence
pentose-phosphate shunt TAS 
response to oxidative stress


Component Evidence
chloroplast
mitochondrion
plastoglobule

3: AT2G01275 zinc finger (C3HC4-type RING finger) family protein [ Ara
+bidopsis
Function Evidence
protein binding RCA 
zinc ion binding RCA 

4: AT2G01320 ABC transporter family protein [ Arabidopsis thaliana ] 
Process Evidence
ATPase activity, coupled to transmembrane movement of substances ISS
[download]

desired output

AT2G01060 

myb family transcription factor

Unknown

transcription factor activity

regulation of transcription

nucleus IEA 

AT2G01140 

fructose-bisphosphate aldolase, putative 
 
Carbon fixation00710 Fructose and mannose metabolism00051 Glycolysis /
+ Gluconeogenesis00010 Pentose phosphate pathway00030

fructose-bisphosphate aldolase activity

pentose-phosphate shunt TAS 
response to oxidative stress

chloroplast
mitochondrion
plastoglobule

AT2G01275 

zinc finger (C3HC4-type RING finger) family protein 

Unknown

protein binding RCA 
zinc ion binding RCA 

Unknown

Unknown

AT2G01320 

ABC transporter family protein 

Unknown

Unknown

ATPase activity, coupled to transmembrane movement of substances ISS 

Unknown
[download]

#!/usr/bin/perl   
# to modify cleangene
# an infile (to be read in)and an outfile (to write to)
# and both should be open
$infile = "clean.txt"; #output of batch entrez gene cleaned
open (IN, $infile) or die "can't open file: $!";

$outfile = "genetable.txt";
open (OUT, ">$outfile") or die "can't open file: $!";
# reading one line at a time using the FILE handle

while (<IN>) {
    if ($_ =~ /^(\d+:\s\w.+)/) { # disecting the first line into locus
+ tag and name
    $name = $_;
    $name =~ s/(\[\sArabidopsis\sthaliana\s\])|(\[\sArabidopsis\s)|(\[
+)//;
    $name =~ s/^\d+:\s//;
    @array = split(/\s+/, $name);
    $locus_tag = @array[0];
    print OUT "$locus_tag\n \n";
    $name =~ s/^(AT\w+)|(\w+)//;
    $name =~ s/^\s//;
    print OUT "$name\n";
}
    next if /(^Function\sEvidence)|^(Process\sEvidence)|^(Component\sE
+vidence)|^(\d+:\s\w.+)/;
    
    if ($_ =~ /(KEGG\spathway:)|(\w+\d\d\d\d\d\s)/){ #removing "KEGG p
+athway" from the kegg description
        $kegg = $_,
        $kegg =~ s/^(KEGG\spathway:\s)//;
        $kegg =~ s/KEGG/ KEGG/g;
        $kegg =~ s/(KEGG\spathway:\s)//g;
        print OUT "$kegg";
}    
    else {print OUT $_;}
    
    
    
}
[download]

In reply to parsing multiple lines by sm2004

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.