Hi,
I have a text file that needs to be modified. It has blocks of numbered descriptions. I would like to separate the information in different lines and include 'unknown' when a part of the description is missing. Here is a list of steps I would like the script to do:
1. The first word following the number is the locus tag. I want to remove the number and have the locus tag on a separate line (this part works in my code).
2. The rest of the first line is the name. I want this to be in another line with the 'Arabidopsis thaliana' removed (this part works in my code).
3. If the next line starts with KEGG I want all lines until 'function evidence' to be written as a block without "KEGG pathway (this part works in my code). if this part is absent I want to write 'unknown' (need help).
4. Then there should be descriptions for 'function evidence', 'process evidence', and 'component evidence'(in the same order). If any one of these are missing, 'unknown' needs to be written in place of that (need help). Then these subtitles 'function evidence', 'process evidence', and 'component evidence' need to be omitted (this part works in my code).
5. All categories(6): locus tag, name, kegg, function evidence, process evidence, and component evidence needs to be seperated by a new line before writing the next category.
I have managed to get some of the steps, but haven't been able to figure how to write 'unknown' when some of the categories are missing. I'm new to perl and it would be great if someone can help me with this half written script I have. I would like a solution where I don't have to use any modules or subroutines. Any tips/ideas greatly appreciated. Thanks.
here is the input file, desired output and the script I've written:
sample input file: 1: AT2G01060 myb family transcription factor [ Arabidopsis thaliana ] Function Evidence transcription factor activity Process Evidence regulation of transcription Component Evidence nucleus IEA 2: AT2G01140 fructose-bisphosphate aldolase, putative [ Arabidopsis th +aliana ] KEGG pathway: Carbon fixation00710KEGG pathway: Fructose and mannose metabolism00051KEGG pathway: Glycolysis / Gluconeogenesis00010KEGG pat +hway: Pentose phosphate pathway00030 Function Evidence fructose-bisphosphate aldolase activity Process Evidence pentose-phosphate shunt TAS response to oxidative stress Component Evidence chloroplast mitochondrion plastoglobule 3: AT2G01275 zinc finger (C3HC4-type RING finger) family protein [ Ara +bidopsis Function Evidence protein binding RCA zinc ion binding RCA 4: AT2G01320 ABC transporter family protein [ Arabidopsis thaliana ] Process Evidence ATPase activity, coupled to transmembrane movement of substances ISS

desired output
AT2G01060 myb family transcription factor Unknown transcription factor activity regulation of transcription nucleus IEA AT2G01140 fructose-bisphosphate aldolase, putative Carbon fixation00710 Fructose and mannose metabolism00051 Glycolysis / + Gluconeogenesis00010 Pentose phosphate pathway00030 fructose-bisphosphate aldolase activity pentose-phosphate shunt TAS response to oxidative stress chloroplast mitochondrion plastoglobule AT2G01275 zinc finger (C3HC4-type RING finger) family protein Unknown protein binding RCA zinc ion binding RCA Unknown Unknown AT2G01320 ABC transporter family protein Unknown Unknown ATPase activity, coupled to transmembrane movement of substances ISS Unknown
#!/usr/bin/perl # to modify cleangene # an infile (to be read in)and an outfile (to write to) # and both should be open $infile = "clean.txt"; #output of batch entrez gene cleaned open (IN, $infile) or die "can't open file: $!"; $outfile = "genetable.txt"; open (OUT, ">$outfile") or die "can't open file: $!"; # reading one line at a time using the FILE handle while (<IN>) { if ($_ =~ /^(\d+:\s\w.+)/) { # disecting the first line into locus + tag and name $name = $_; $name =~ s/(\[\sArabidopsis\sthaliana\s\])|(\[\sArabidopsis\s)|(\[ +)//; $name =~ s/^\d+:\s//; @array = split(/\s+/, $name); $locus_tag = @array[0]; print OUT "$locus_tag\n \n"; $name =~ s/^(AT\w+)|(\w+)//; $name =~ s/^\s//; print OUT "$name\n"; } next if /(^Function\sEvidence)|^(Process\sEvidence)|^(Component\sE +vidence)|^(\d+:\s\w.+)/; if ($_ =~ /(KEGG\spathway:)|(\w+\d\d\d\d\d\s)/){ #removing "KEGG p +athway" from the kegg description $kegg = $_, $kegg =~ s/^(KEGG\spathway:\s)//; $kegg =~ s/KEGG/ KEGG/g; $kegg =~ s/(KEGG\spathway:\s)//g; print OUT "$kegg"; } else {print OUT $_;} }

In reply to parsing multiple lines by sm2004

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.