Okay, here you see the elements of the array (@array) for one textblock (after using split). Now it should be easier to understand the rest of the script.
[0] NC_014171.1
[1] RefSeq
[2] gene
[3] 14311
[4] 14425
[5] .
[6] +
[7] .
[8] ID=NC_014171.1:rrs_1;locus_tag=BMB171_C5091;db_xref=GeneID:9190898
new line but same 'textblock' of data
[0] NC_014171.1
[1] RefSeq
[2] exon (oder CDS)
[3] 14311
[4] 14425
[5] .
[6] +
[7] .
[8] ID=NC_014171.1:rrs_1:unknown_transcript_1;Parent=NC_014171.1:rrs_1
+;gbkey=rRNA;locus_tag=BMB171_C5091;product=5S ribosomal RNA;db_xref=G
+eneID:9190898;exon_number=1
What I want is to create a hash:
I want to extract the number behind the word 'GeneID' (line one of textblock, at the end of element 8) and this number should be the key of my hash.
The values for my hash should first be stored in an array. I need the following information as values for my array:
line one of textblock: $array 3 which is a number, $array 4 again a number, $ array 6 which is + or -
line two of textblock $array 2 which can be the word 'CDS' or 'exon'
So far the script is working. Problems arise when the next text block is processed. | [reply] [d/l] |
I apologise for rewriting a major portion of your code. I usually try to change as little as possible, but somehow that didn't work here :)
The major approach change is to use the readline-function to read data from the textfile as needed. It seemed like whenever you found a 'gene'-line, you would need to read the next line for 'CDS' or 'exon'-data. You could do this with a flag (as you initially suggested), but why not do simply what you need to do...read the next line immediately?
The other thing I changes was that the data is now stored in a hash-reference (instead of a hash). This is not per se a requirement, but Data::Dump prints hashrefs in an easier to understand way than hashes.
Also, I replaced the data-entries @BMB with $ar_record. It is easier to store lots of records as references instead of arrays.
Lastly, I removed a lot of variable declarations from the stat of the program and put them where they are needed/filled. There is no need to fear a negative performance impact due to initializing variables within a loop. Perl handles this just fine. This will also help you to keep the data in scope (so your main program won't know the 'temporary' variables that were used inside the loop). (I'm not sure I'm explaining this well...)
#!/usr/bin/perl
# Task: Extract GeneID-Number and gene information
use strict;
use warnings;
use Data::Dump;
my $in;
my $hr_data;
# 1) open the .gff Inputfile and while reading line by line split $dat
+a at each tab and put them in the @array
open ($in, '<', "Genomteil.gff") or die $!;
while (my $line1 = readline ($in)) {
chomp ($line1); # Removes trailing \n
my @a_line1 = split ("\t", $line1);
if ($a_line1[2] eq 'gene') {
if ($a_line1[8] =~ /.*;db_xref=GeneID:(\d+)/) {
$GeneID = $1;
# We found a GeneID. Create a record (array-reference) to
+store with the data from this line
my $ar_record = [$a_line1[3], $a_line1[4], $a_line1[6]];
+ #the array will be used as values for my hash later
+ # Also, read the next line from file, which we expect to contain CD
+S or exon
my $line2 = readline ($in);
chomp ($line2);
my @a_line2 = split ("\t", $line2);
if ($a_line2[2] =~ /CDS|exon/) {
+ # Alternatively: ($a_line2[2] eq 'CDS' or $a_line2[2] eq 'exon')
push (@{$ar_record}, $a_line2[2]);
$hr_data->{$GeneID} = $ar_record;
} else {
print ("Error: next line does not contain CDS or exon
+[$.]\n");
next;
}
} else {
print ("Error: 'gene' textblock found, but no GeneID prese
+nt at line [$.]\n");
next;
}
} ## end if ($a_line1[2] eq 'gene')
} ## end while (my $line1 = readline...)
close $in;
Data::Dump::dd($hr_data);
| [reply] [d/l] [select] |
Hi,
thanks for your efforts!
I see your point and I liked the changes of variable declarations and the stuff with the references. Unfortunately I can't use the readline Function because I got the specific task to use flags (only flags) to solve that problem. Do you have any idea for that, even when your script makes much more sense than using flags?
| [reply] [d/l] |