sample input file:
1: AT2G01060 myb family transcription factor [ Arabidopsis thaliana ]
Function Evidence
transcription factor activity
Process Evidence
regulation of transcription
Component Evidence
nucleus IEA
2: AT2G01140 fructose-bisphosphate aldolase, putative [ Arabidopsis thaliana ]
KEGG pathway: Carbon fixation00710KEGG pathway: Fructose and mannose
metabolism00051KEGG pathway: Glycolysis / Gluconeogenesis00010KEGG pathway:
Pentose phosphate pathway00030
Function Evidence
fructose-bisphosphate aldolase activity
Process Evidence
pentose-phosphate shunt TAS
response to oxidative stress
Component Evidence
chloroplast
mitochondrion
plastoglobule
3: AT2G01275 zinc finger (C3HC4-type RING finger) family protein [ Arabidopsis
Function Evidence
protein binding RCA
zinc ion binding RCA
4: AT2G01320 ABC transporter family protein [ Arabidopsis thaliana ]
Process Evidence
ATPase activity, coupled to transmembrane movement of substances ISS
####
AT2G01060
myb family transcription factor
Unknown
transcription factor activity
regulation of transcription
nucleus IEA
AT2G01140
fructose-bisphosphate aldolase, putative
Carbon fixation00710 Fructose and mannose metabolism00051 Glycolysis / Gluconeogenesis00010 Pentose phosphate pathway00030
fructose-bisphosphate aldolase activity
pentose-phosphate shunt TAS
response to oxidative stress
chloroplast
mitochondrion
plastoglobule
AT2G01275
zinc finger (C3HC4-type RING finger) family protein
Unknown
protein binding RCA
zinc ion binding RCA
Unknown
Unknown
AT2G01320
ABC transporter family protein
Unknown
Unknown
ATPase activity, coupled to transmembrane movement of substances ISS
Unknown
####
#!/usr/bin/perl
# to modify cleangene
# an infile (to be read in)and an outfile (to write to)
# and both should be open
$infile = "clean.txt"; #output of batch entrez gene cleaned
open (IN, $infile) or die "can't open file: $!";
$outfile = "genetable.txt";
open (OUT, ">$outfile") or die "can't open file: $!";
# reading one line at a time using the FILE handle
while () {
if ($_ =~ /^(\d+:\s\w.+)/) { # disecting the first line into locus tag and name
$name = $_;
$name =~ s/(\[\sArabidopsis\sthaliana\s\])|(\[\sArabidopsis\s)|(\[)//;
$name =~ s/^\d+:\s//;
@array = split(/\s+/, $name);
$locus_tag = @array[0];
print OUT "$locus_tag\n \n";
$name =~ s/^(AT\w+)|(\w+)//;
$name =~ s/^\s//;
print OUT "$name\n";
}
next if /(^Function\sEvidence)|^(Process\sEvidence)|^(Component\sEvidence)|^(\d+:\s\w.+)/;
if ($_ =~ /(KEGG\spathway:)|(\w+\d\d\d\d\d\s)/){ #removing "KEGG pathway" from the kegg description
$kegg = $_,
$kegg =~ s/^(KEGG\spathway:\s)//;
$kegg =~ s/KEGG/ KEGG/g;
$kegg =~ s/(KEGG\spathway:\s)//g;
print OUT "$kegg";
}
else {print OUT $_;}
}