sm2004 has asked for the wisdom of the Perl Monks concerning the following question:
sample input file: 1: AT2G01060 myb family transcription factor [ Arabidopsis thaliana ] Function Evidence transcription factor activity Process Evidence regulation of transcription Component Evidence nucleus IEA 2: AT2G01140 fructose-bisphosphate aldolase, putative [ Arabidopsis th +aliana ] KEGG pathway: Carbon fixation00710KEGG pathway: Fructose and mannose metabolism00051KEGG pathway: Glycolysis / Gluconeogenesis00010KEGG pat +hway: Pentose phosphate pathway00030 Function Evidence fructose-bisphosphate aldolase activity Process Evidence pentose-phosphate shunt TAS response to oxidative stress Component Evidence chloroplast mitochondrion plastoglobule 3: AT2G01275 zinc finger (C3HC4-type RING finger) family protein [ Ara +bidopsis Function Evidence protein binding RCA zinc ion binding RCA 4: AT2G01320 ABC transporter family protein [ Arabidopsis thaliana ] Process Evidence ATPase activity, coupled to transmembrane movement of substances ISS
AT2G01060 myb family transcription factor Unknown transcription factor activity regulation of transcription nucleus IEA AT2G01140 fructose-bisphosphate aldolase, putative Carbon fixation00710 Fructose and mannose metabolism00051 Glycolysis / + Gluconeogenesis00010 Pentose phosphate pathway00030 fructose-bisphosphate aldolase activity pentose-phosphate shunt TAS response to oxidative stress chloroplast mitochondrion plastoglobule AT2G01275 zinc finger (C3HC4-type RING finger) family protein Unknown protein binding RCA zinc ion binding RCA Unknown Unknown AT2G01320 ABC transporter family protein Unknown Unknown ATPase activity, coupled to transmembrane movement of substances ISS Unknown
#!/usr/bin/perl # to modify cleangene # an infile (to be read in)and an outfile (to write to) # and both should be open $infile = "clean.txt"; #output of batch entrez gene cleaned open (IN, $infile) or die "can't open file: $!"; $outfile = "genetable.txt"; open (OUT, ">$outfile") or die "can't open file: $!"; # reading one line at a time using the FILE handle while (<IN>) { if ($_ =~ /^(\d+:\s\w.+)/) { # disecting the first line into locus + tag and name $name = $_; $name =~ s/(\[\sArabidopsis\sthaliana\s\])|(\[\sArabidopsis\s)|(\[ +)//; $name =~ s/^\d+:\s//; @array = split(/\s+/, $name); $locus_tag = @array[0]; print OUT "$locus_tag\n \n"; $name =~ s/^(AT\w+)|(\w+)//; $name =~ s/^\s//; print OUT "$name\n"; } next if /(^Function\sEvidence)|^(Process\sEvidence)|^(Component\sE +vidence)|^(\d+:\s\w.+)/; if ($_ =~ /(KEGG\spathway:)|(\w+\d\d\d\d\d\s)/){ #removing "KEGG p +athway" from the kegg description $kegg = $_, $kegg =~ s/^(KEGG\spathway:\s)//; $kegg =~ s/KEGG/ KEGG/g; $kegg =~ s/(KEGG\spathway:\s)//g; print OUT "$kegg"; } else {print OUT $_;} }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: parsing multiple lines
by toolic (Bishop) on May 21, 2008 at 18:43 UTC | |
by sm2004 (Acolyte) on May 22, 2008 at 00:49 UTC | |
|
Re: parsing multiple lines
by apl (Monsignor) on May 21, 2008 at 18:31 UTC | |
by sm2004 (Acolyte) on May 22, 2008 at 00:46 UTC | |
|
Re: parsing multiple lines
by psini (Deacon) on May 21, 2008 at 18:58 UTC | |
by sm2004 (Acolyte) on May 22, 2008 at 00:51 UTC | |
|
Re: parsing multiple lines
by pc88mxer (Vicar) on May 21, 2008 at 19:13 UTC | |
by GrandFather (Saint) on May 21, 2008 at 21:19 UTC | |
by sm2004 (Acolyte) on May 22, 2008 at 00:53 UTC | |
|
Re: parsing multiple lines
by mwah (Hermit) on May 21, 2008 at 19:24 UTC | |
by sm2004 (Acolyte) on May 22, 2008 at 01:07 UTC | |
by sm2004 (Acolyte) on May 28, 2008 at 22:52 UTC | |
|
Re: parsing multiple lines
by hesco (Deacon) on May 21, 2008 at 19:19 UTC | |
|
Re: parsing multiple lines
by Anonymous Monk on May 22, 2008 at 21:46 UTC |