Re^2: problems with flip flop

Okay, here you see the elements of the array (@array) for one textblock (after using split). Now it should be easier to understand the rest of the script.

[0] NC_014171.1    
[1] RefSeq    
[2] gene    
[3] 14311    
[4] 14425    
[5]   .    
[6]  +    
[7]  .    
[8] ID=NC_014171.1:rrs_1;locus_tag=BMB171_C5091;db_xref=GeneID:9190898
new line but same 'textblock' of data

[0] NC_014171.1    
[1] RefSeq    
[2] exon (oder CDS)    
[3] 14311    
[4] 14425    
[5]  .    
[6]  +    
[7]  .    
[8] ID=NC_014171.1:rrs_1:unknown_transcript_1;Parent=NC_014171.1:rrs_1
+;gbkey=rRNA;locus_tag=BMB171_C5091;product=5S ribosomal RNA;db_xref=G
+eneID:9190898;exon_number=1
[download]

What I want is to create a hash: I want to extract the number behind the word 'GeneID' (line one of textblock, at the end of element 8) and this number should be the key of my hash. The values for my hash should first be stored in an array. I need the following information as values for my array: line one of textblock: $array 3 which is a number, $array 4 again a number, $ array 6 which is + or - line two of textblock $array 2 which can be the word 'CDS' or 'exon' So far the script is working. Problems arise when the next text block is processed.

Comment on Re^2: problems with flip flop Download Code

Replies are listed 'Best First'.
Re^3: problems with flip flop by Neighbour (Friar) on Aug 17, 2011 at 12:00 UTC
I apologise for rewriting a major portion of your code. I usually try to change as little as possible, but somehow that didn't work here :) The major approach change is to use the `readline`-function to read data from the textfile as needed. It seemed like whenever you found a 'gene'-line, you would need to read the next line for 'CDS' or 'exon'-data. You could do this with a flag (as you initially suggested), but why not do simply what you need to do...read the next line immediately? The other thing I changes was that the data is now stored in a hash-reference (instead of a hash). This is not per se a requirement, but Data::Dump prints hashrefs in an easier to understand way than hashes. Also, I replaced the data-entries @BMB with $ar_record. It is easier to store lots of records as references instead of arrays. Lastly, I removed a lot of variable declarations from the stat of the program and put them where they are needed/filled. There is no need to fear a negative performance impact due to initializing variables within a loop. Perl handles this just fine. This will also help you to keep the data in scope (so your main program won't know the 'temporary' variables that were used inside the loop). (I'm not sure I'm explaining this well...) #!/usr/bin/perl # Task: Extract GeneID-Number and gene information use strict; use warnings; use Data::Dump; my $in; my $hr_data; # 1) open the .gff Inputfile and while reading line by line split $dat +a at each tab and put them in the @array open ($in, '<', "Genomteil.gff") or die $!; while (my $line1 = readline ($in)) { chomp ($line1); # Removes trailing \n my @a_line1 = split ("\t", $line1); if ($a_line1[2] eq 'gene') { if ($a_line1[8] =~ /.*;db_xref=GeneID:(\d+)/) { $GeneID = $1; # We found a GeneID. Create a record (array-reference) to +store with the data from this line my $ar_record = [$a_line1[3], $a_line1[4], $a_line1[6]]; + #the array will be used as values for my hash later + # Also, read the next line from file, which we expect to contain CD +S or exon my $line2 = readline ($in); chomp ($line2); my @a_line2 = split ("\t", $line2); if ($a_line2[2] =~ /CDS\|exon/) { + # Alternatively: ($a_line2[2] eq 'CDS' or $a_line2[2] eq 'exon') push (@{$ar_record}, $a_line2[2]); $hr_data->{$GeneID} = $ar_record; } else { print ("Error: next line does not contain CDS or exon +[$.]\n"); next; } } else { print ("Error: 'gene' textblock found, but no GeneID prese +nt at line [$.]\n"); next; } } ## end if ($a_line1[2] eq 'gene') } ## end while (my $line1 = readline...) close $in; Data::Dump::dd($hr_data); [download]	[reply] [d/l] [select]
Re^4: problems with flip flop by bio25 (Initiate) on Aug 17, 2011 at 14:30 UTC
Hi, thanks for your efforts! I see your point and I liked the changes of variable declarations and the stuff with the references. Unfortunately I can't use the `readline` Function because I got the specific task to use flags (only flags) to solve that problem. Do you have any idea for that, even when your script makes much more sense than using flags?	[reply] [d/l]
Re^5: problems with flip flop by Neighbour (Friar) on Aug 17, 2011 at 14:58 UTC
In that case, set the $flag-variable to 0 when you expect to read a 'gene'-line, and set it to 1 when you expect to read a 'CDS' or 'exon'-line. You will need to keep the @BMB-variable declared before going into the while-loop, since you need to keep the data you put in there (when $flag was 0) to be still there when you're going through the while-loop with $flag being 1. Note that the error-handling is not present in this version, but you can add that yourself. This results in the following: #!/usr/bin/perl # Task: Extract GeneID-Number and gene information use strict; use warnings; my $in; my $data; my @array; my $array; my $GeneID; my @BMB; my $flag = 0; my %hash; my $hash; # 1) open the .gff Inputfile and while reading line by line split $dat +a at each tab and put them in the @array open $in, '<', "Genomteil.gff" or die $!; while ($data = <$in>) { @array = split (/\t/, $data); if ($flag == 0) { if ($array[2] =~ /gene/) { #if you find the word 'gene' a t +extbloxk follows which contains some information I want to extract an +d put in an array) $flag = 1; # Set the flag. We will be expecting a 'CDS' or + 'exon'-line next @BMB = ($array[3], $array[4], $array[6]); #the array wi +ll be used as values for my hash later } ## end if ($array[2] =~ /gene/) if ($array[8] =~ /.;db_xref=GeneID:(\d+)\n/) { #if you fin +d the word 'GeneID' extract the following number and put it in my has +h (as key), then put the array in my hash $GeneID = $1; } ## end if ($array[8] =~ /.;db_xref=GeneID:(\d+)\n/) } elsif ($flag == 1) { if ($array[2] =~ /CDS/) { push (@BMB, $array[2]); #put more data in my array } elsif ($array[2] =~ /exon/) { push (@BMB, $array[2]); } @{$hash{$GeneID}} = @BMB; $flag = 0; # Reset the flag. We will be expecting a 'gene'-lin +e next } } ## end while ($data = <$in>) close $in; my $BMB; while (($GeneID, $BMB) = each %hash) { print "$GeneID => $BMB[0]\n"; } [download]	[reply] [d/l]
Re^6: problems with flip flop by bio25 (Initiate) on Aug 17, 2011 at 16:06 UTC
Re^7: problems with flip flop by bio25 (Initiate) on Aug 18, 2011 at 07:21 UTC
Some notes below your chosen depth have not been shown here