bio25 has asked for the wisdom of the Perl Monks concerning the following question:

Hi dear all! I have a problem with using the flip-flop. First I need to show you the text file (gff.file) which contains the data I want to work with:

NC_014171.1 RefSeq gene 14311 14425 . + . ID=NC_014171.1:rrs_1;locus_t +ag=BMB171_C5091;db_xref=GeneID:9190898 NC_014171.1 RefSeq exon 14311 14425 . + . ID=NC_014171.1:rrs_1:unknown +_transcript_1;Parent=NC_014171.1:rrs_1;gbkey=rRNA;locus_tag=BMB171_C5 +091 ;product=5S ribosomal RNA;db_xref=GeneID:9190898;exon_number=1 NC_014171.1 RefSeq gene 14459 15460 . - . locus_tag=BMB171_C0007;db_xr +ef=GeneID:9190899 NC_014171.1 RefSeq CDS 14462 15460 . - 0 locus_tag=BMB171_C0007;transl +_table=11;product=hypothetical protein;protein_id=YP_003662545.1;db_x +ref=GI:296500845;db_xref=GeneID:9190899;exon_number=1

The empty line above doesn't exist in the text file. I just wanted to show you which lines belong together. Now, here is my script:

# Task: Extract GeneID-Number and gene information #!/usr/bin/perl use strict; use warnings; my $in; my $data; my @array; my $array; my $GeneID; my @BMB; my $BMB; my $flag = 0; my %hash; my $hash; # 1) open the .gff Inputfile and while reading line by line split $dat +a at each tab and put them in the @array open $in, '<', "Genomteil.gff" or die $!; while ($data = <$in>) { @array = split(/\t/, $data); if ($array [2] =~/gene/){ #if you find the word 'gene' a textbloxk fol +lows which contains some information I want to extract and put in an +array) $flag = 1; @BMB = ($array[3], $array[4], $array[6]); #the array will be used as v +alues for my hash later } if ($array[2] =~/CDS/){ push (@BMB, $array[2]); #put more data in my array } elsif ($array[2] =~/exon/){ push (@BMB, $array[2]); } if ($array[8] =~ /.*;db_xref=GeneID:(\d+)\n/) { #if you find the word +'GeneID' extract the following number and put it in my hash (as key), + then put the array in my hash $GeneID = $1; @{$hash {$GeneID}} = @BMB; } if ($array [8]=~ /.*;exon_number=1/){ #if you find the word 'exon numb +er', then the textblock is over $flag = 0; } } close $in; while ( ($GeneID, $BMB) = each %hash) { print "$GeneID => $BMB[0]\n"; }

Okay, the script works but I do have more than one textblock I want to work with.For each time in the loop the new data which are putted in my array overwrite the data from the last time. So in the end, I only have information of the last textblock in my array. My supervisor told me what she wants me to do: I should think about something like a 'flag' which recognizes, that a new textblock appears which contains new data I want to store in the array. Interestingly I don't have a problem with the keys. I do have each key in my output (but each key has the same values). The problem is, I don't know how such a 'flag' could look like - so I don't know for what I have to search in literature etc. I hope you understand my problem, my english isn't the best. Has anybody an idea to help? Best wishes

Replies are listed 'Best First'.
Re: problems with flip flop
by i5513 (Pilgrim) on Aug 17, 2011 at 08:27 UTC
    Hi,

    Use push for not overwritte your array and adding elements to it.

    Flags usually are used like this example:

    while (...) { if (/xxyy/) # end of your block searched { $flag = 0; } if (/.../) { $flag = 1; } ... ... if ($flag == 1) { ... } }

    I didn't understand your question completely, but I hope that helps

    Regards,

      Hey. I guess $flag==1 is what I'm searching for. But where in my script do I have to write it?

Re: problems with flip flop
by Neighbour (Friar) on Aug 17, 2011 at 10:30 UTC
    Let's go back to the simple version
    There is one 'textblock' of data.
    Now...what is it you want to end up with?

    PS. The tabs are apparently not present anymore when copying the example data you posted. This doesn't help much when trying to reproduce what your program does :)

      Okay, here you see the elements of the array (@array) for one textblock (after using split). Now it should be easier to understand the rest of the script.

      [0] NC_014171.1 [1] RefSeq [2] gene [3] 14311 [4] 14425 [5] . [6] + [7] . [8] ID=NC_014171.1:rrs_1;locus_tag=BMB171_C5091;db_xref=GeneID:9190898 new line but same 'textblock' of data [0] NC_014171.1 [1] RefSeq [2] exon (oder CDS) [3] 14311 [4] 14425 [5] . [6] + [7] . [8] ID=NC_014171.1:rrs_1:unknown_transcript_1;Parent=NC_014171.1:rrs_1 +;gbkey=rRNA;locus_tag=BMB171_C5091;product=5S ribosomal RNA;db_xref=G +eneID:9190898;exon_number=1

      What I want is to create a hash: I want to extract the number behind the word 'GeneID' (line one of textblock, at the end of element 8) and this number should be the key of my hash. The values for my hash should first be stored in an array. I need the following information as values for my array: line one of textblock: $array 3 which is a number, $array 4 again a number, $ array 6 which is + or - line two of textblock $array 2 which can be the word 'CDS' or 'exon' So far the script is working. Problems arise when the next text block is processed.

        I apologise for rewriting a major portion of your code. I usually try to change as little as possible, but somehow that didn't work here :)
        The major approach change is to use the readline-function to read data from the textfile as needed. It seemed like whenever you found a 'gene'-line, you would need to read the next line for 'CDS' or 'exon'-data. You could do this with a flag (as you initially suggested), but why not do simply what you need to do...read the next line immediately?
        The other thing I changes was that the data is now stored in a hash-reference (instead of a hash). This is not per se a requirement, but Data::Dump prints hashrefs in an easier to understand way than hashes.
        Also, I replaced the data-entries @BMB with $ar_record. It is easier to store lots of records as references instead of arrays.
        Lastly, I removed a lot of variable declarations from the stat of the program and put them where they are needed/filled. There is no need to fear a negative performance impact due to initializing variables within a loop. Perl handles this just fine. This will also help you to keep the data in scope (so your main program won't know the 'temporary' variables that were used inside the loop). (I'm not sure I'm explaining this well...)
        #!/usr/bin/perl # Task: Extract GeneID-Number and gene information use strict; use warnings; use Data::Dump; my $in; my $hr_data; # 1) open the .gff Inputfile and while reading line by line split $dat +a at each tab and put them in the @array open ($in, '<', "Genomteil.gff") or die $!; while (my $line1 = readline ($in)) { chomp ($line1); # Removes trailing \n my @a_line1 = split ("\t", $line1); if ($a_line1[2] eq 'gene') { if ($a_line1[8] =~ /.*;db_xref=GeneID:(\d+)/) { $GeneID = $1; # We found a GeneID. Create a record (array-reference) to +store with the data from this line my $ar_record = [$a_line1[3], $a_line1[4], $a_line1[6]]; + #the array will be used as values for my hash later + # Also, read the next line from file, which we expect to contain CD +S or exon my $line2 = readline ($in); chomp ($line2); my @a_line2 = split ("\t", $line2); if ($a_line2[2] =~ /CDS|exon/) { + # Alternatively: ($a_line2[2] eq 'CDS' or $a_line2[2] eq 'exon') push (@{$ar_record}, $a_line2[2]); $hr_data->{$GeneID} = $ar_record; } else { print ("Error: next line does not contain CDS or exon +[$.]\n"); next; } } else { print ("Error: 'gene' textblock found, but no GeneID prese +nt at line [$.]\n"); next; } } ## end if ($a_line1[2] eq 'gene') } ## end while (my $line1 = readline...) close $in; Data::Dump::dd($hr_data);