sample input file: 1: AT2G01060 myb family transcription factor [ Arabidopsis thaliana ] Function Evidence transcription factor activity Process Evidence regulation of transcription Component Evidence nucleus IEA 2: AT2G01140 fructose-bisphosphate aldolase, putative [ Arabidopsis thaliana ] KEGG pathway: Carbon fixation00710KEGG pathway: Fructose and mannose metabolism00051KEGG pathway: Glycolysis / Gluconeogenesis00010KEGG pathway: Pentose phosphate pathway00030 Function Evidence fructose-bisphosphate aldolase activity Process Evidence pentose-phosphate shunt TAS response to oxidative stress Component Evidence chloroplast mitochondrion plastoglobule 3: AT2G01275 zinc finger (C3HC4-type RING finger) family protein [ Arabidopsis Function Evidence protein binding RCA zinc ion binding RCA 4: AT2G01320 ABC transporter family protein [ Arabidopsis thaliana ] Process Evidence ATPase activity, coupled to transmembrane movement of substances ISS #### AT2G01060 myb family transcription factor Unknown transcription factor activity regulation of transcription nucleus IEA AT2G01140 fructose-bisphosphate aldolase, putative Carbon fixation00710 Fructose and mannose metabolism00051 Glycolysis / Gluconeogenesis00010 Pentose phosphate pathway00030 fructose-bisphosphate aldolase activity pentose-phosphate shunt TAS response to oxidative stress chloroplast mitochondrion plastoglobule AT2G01275 zinc finger (C3HC4-type RING finger) family protein Unknown protein binding RCA zinc ion binding RCA Unknown Unknown AT2G01320 ABC transporter family protein Unknown Unknown ATPase activity, coupled to transmembrane movement of substances ISS Unknown #### #!/usr/bin/perl # to modify cleangene # an infile (to be read in)and an outfile (to write to) # and both should be open $infile = "clean.txt"; #output of batch entrez gene cleaned open (IN, $infile) or die "can't open file: $!"; $outfile = "genetable.txt"; open (OUT, ">$outfile") or die "can't open file: $!"; # reading one line at a time using the FILE handle while () { if ($_ =~ /^(\d+:\s\w.+)/) { # disecting the first line into locus tag and name $name = $_; $name =~ s/(\[\sArabidopsis\sthaliana\s\])|(\[\sArabidopsis\s)|(\[)//; $name =~ s/^\d+:\s//; @array = split(/\s+/, $name); $locus_tag = @array[0]; print OUT "$locus_tag\n \n"; $name =~ s/^(AT\w+)|(\w+)//; $name =~ s/^\s//; print OUT "$name\n"; } next if /(^Function\sEvidence)|^(Process\sEvidence)|^(Component\sEvidence)|^(\d+:\s\w.+)/; if ($_ =~ /(KEGG\spathway:)|(\w+\d\d\d\d\d\s)/){ #removing "KEGG pathway" from the kegg description $kegg = $_, $kegg =~ s/^(KEGG\spathway:\s)//; $kegg =~ s/KEGG/ KEGG/g; $kegg =~ s/(KEGG\spathway:\s)//g; print OUT "$kegg"; } else {print OUT $_;} }