I apologise for rewriting a major portion of your code. I usually try to change as little as possible, but somehow that didn't work here :)
The major approach change is to use the readline-function to read data from the textfile as needed. It seemed like whenever you found a 'gene'-line, you would need to read the next line for 'CDS' or 'exon'-data. You could do this with a flag (as you initially suggested), but why not do simply what you need to do...read the next line immediately?
The other thing I changes was that the data is now stored in a hash-reference (instead of a hash). This is not per se a requirement, but Data::Dump prints hashrefs in an easier to understand way than hashes.
Also, I replaced the data-entries @BMB with $ar_record. It is easier to store lots of records as references instead of arrays.
Lastly, I removed a lot of variable declarations from the stat of the program and put them where they are needed/filled. There is no need to fear a negative performance impact due to initializing variables within a loop. Perl handles this just fine. This will also help you to keep the data in scope (so your main program won't know the 'temporary' variables that were used inside the loop). (I'm not sure I'm explaining this well...)
#!/usr/bin/perl
# Task: Extract GeneID-Number and gene information
use strict;
use warnings;
use Data::Dump;
my $in;
my $hr_data;
# 1) open the .gff Inputfile and while reading line by line split $dat
+a at each tab and put them in the @array
open ($in, '<', "Genomteil.gff") or die $!;
while (my $line1 = readline ($in)) {
chomp ($line1); # Removes trailing \n
my @a_line1 = split ("\t", $line1);
if ($a_line1[2] eq 'gene') {
if ($a_line1[8] =~ /.*;db_xref=GeneID:(\d+)/) {
$GeneID = $1;
# We found a GeneID. Create a record (array-reference) to
+store with the data from this line
my $ar_record = [$a_line1[3], $a_line1[4], $a_line1[6]];
+ #the array will be used as values for my hash later
+ # Also, read the next line from file, which we expect to contain CD
+S or exon
my $line2 = readline ($in);
chomp ($line2);
my @a_line2 = split ("\t", $line2);
if ($a_line2[2] =~ /CDS|exon/) {
+ # Alternatively: ($a_line2[2] eq 'CDS' or $a_line2[2] eq 'exon')
push (@{$ar_record}, $a_line2[2]);
$hr_data->{$GeneID} = $ar_record;
} else {
print ("Error: next line does not contain CDS or exon
+[$.]\n");
next;
}
} else {
print ("Error: 'gene' textblock found, but no GeneID prese
+nt at line [$.]\n");
next;
}
} ## end if ($a_line1[2] eq 'gene')
} ## end while (my $line1 = readline...)
close $in;
Data::Dump::dd($hr_data);
| [reply] [d/l] [select] |
Hi,
thanks for your efforts!
I see your point and I liked the changes of variable declarations and the stuff with the references. Unfortunately I can't use the readline Function because I got the specific task to use flags (only flags) to solve that problem. Do you have any idea for that, even when your script makes much more sense than using flags?
| [reply] [d/l] |
In that case, set the $flag-variable to 0 when you expect to read a 'gene'-line, and set it to 1 when you expect to read a 'CDS' or 'exon'-line.
You will need to keep the @BMB-variable declared before going into the while-loop, since you need to keep the data you put in there (when $flag was 0) to be still there when you're going through the while-loop with $flag being 1.
Note that the error-handling is not present in this version, but you can add that yourself.
This results in the following:
#!/usr/bin/perl
# Task: Extract GeneID-Number and gene information
use strict;
use warnings;
my $in;
my $data;
my @array;
my $array;
my $GeneID;
my @BMB;
my $flag = 0;
my %hash;
my $hash;
# 1) open the .gff Inputfile and while reading line by line split $dat
+a at each tab and put them in the @array
open $in, '<', "Genomteil.gff" or die $!;
while ($data = <$in>) {
@array = split (/\t/, $data);
if ($flag == 0) {
if ($array[2] =~ /gene/) { #if you find the word 'gene' a t
+extbloxk follows which contains some information I want to extract an
+d put in an array)
$flag = 1; # Set the flag. We will be expecting a 'CDS' or
+ 'exon'-line next
@BMB = ($array[3], $array[4], $array[6]); #the array wi
+ll be used as values for my hash later
} ## end if ($array[2] =~ /gene/)
if ($array[8] =~ /.*;db_xref=GeneID:(\d+)\n/) { #if you fin
+d the word 'GeneID' extract the following number and put it in my has
+h (as key), then put the array in my hash
$GeneID = $1;
} ## end if ($array[8] =~ /.*;db_xref=GeneID:(\d+)\n/)
} elsif ($flag == 1) {
if ($array[2] =~ /CDS/) {
push (@BMB, $array[2]);
#put more data in my array
} elsif ($array[2] =~ /exon/) {
push (@BMB, $array[2]);
}
@{$hash{$GeneID}} = @BMB;
$flag = 0; # Reset the flag. We will be expecting a 'gene'-lin
+e next
}
} ## end while ($data = <$in>)
close $in;
my $BMB;
while (($GeneID, $BMB) = each %hash) {
print "$GeneID => $BMB[0]\n";
}
| [reply] [d/l] |