MadraghRua has asked for the wisdom of the Perl Monks concerning the following question:

Hi Folks

We have a script to write output to a filehandle. The file handle will be accepting maybe 100,000 + lines of output. We keep finding that although the script appears to correctly grab and extract the data we are intersted in, it does not write all this data to the output file - instead we are always missing the final couple of hundred lines of data. We have gotten around the problem by opening the output file handle, writing 1000 lines of data and then closing the output file handle, repeating this each time we need to to go through the entire data set. This does seem a tad kludgey though.

#!/usr/bin/perl -w package main; require 5.004; use strict; use Getopt::Std; ########## CONSTANTS ############### my $ID_LIMIT = 10000 ; use vars qw($opt_h $opt_v $opt_d $opt_i $opt_o $opt_g $opt_p) ; use vars qw($inputFile $outputFile $id $sequenceType) ; use vars qw($key) ; use vars qw($numRecords) ; getopts('i:o:g:p:hvd') ; $opt_h and die <<"QQ_HELP_QQ" ; DESCRIPTION This program reads a fasta 'Genbank formatted file', retrieves th +e gi numbers, and writes the numbers to a text file AVAILABILITY Requires perl 5.004 or higher SYNOPSIS makeGIList.pl -h -i [path/fileToRead] -o [path/fileToWrite] OPTIONS -h Print help message -v Verbose mode -i [path/fileToRead] File to process -o [path/fileToWrite] Destination -g [gi|gb|both] Make file with GI or Accession number +or both -p [n|p] nucleotide or protein, DEFAULT = 'n' EXIT STATUS 0 Fail 1 Successful completion QQ_HELP_QQ if ($opt_d) { $opt_v = 'v' ; $opt_i = "D:\\datasets\\ncbi\\UniGene\\Ta.seq.uniq" ; $opt_o = "D:\\datasets\\ncbi\\UniGene\\gbnumTest.id" ; } # Input path parameter if ($opt_i) { $inputFile = $opt_i ; } else { die "an input file with its path is required" ; } # output path and filename if ($opt_o) { $outputFile = $opt_o ; } else { die "an ouput file with its path is required" ; } # id to fetch, GI, Accession, or both if ($opt_g) { if (($opt_g eq 'gi') or ($opt_g eq 'GI')) { $id = 'gi' ; } elsif (($opt_g eq 'gb') or ($opt_g eq 'GB')) { $id = 'gi' ; } elsif (($opt_g eq 'both') or ($opt_g eq 'BOTH')) { $id = 'both' ; } } else { $opt_g = 'gb' ; $id = $opt_g ; } # type of file if ($opt_p) { if (($opt_p eq 'n') or ($opt_p eq 'N')) { $sequenceType = 'n'; } elsif (($opt_p eq 'p') or ($opt_p eq 'P')) { $sequenceType = 'p' ; } } else { $opt_p = 'p' ; $sequenceType = $opt_p ; } # Debug mode # print out variables mapped from input flags if ($opt_d) { print "\$opt_d: $opt_d DEBUG mode on" ; # undocumented print "\$opt_h: $opt_h\n" ; print "\$opt_v: $opt_v\n" ; print "\$opt_i: $opt_i ==> \$inputFile: $inputFile\n" ; print "\$opt_o: $opt_o ==> \$outputFile: $outputFile\n" ; print "\$opt_g: $opt_g ==> \$id: $id\n" ; print "\$opt_p: $opt_p\==> \$sequenceType: $sequenceType\n" ; } #################### END OF INPUT FLAGS ######################## ################################################################# #################### Main Program ############################## #open source file if (open (INPUT, "<$inputFile")) { $opt_v and print "opened file $inputFile\n" ; } else { die "Can not open $inputFile: $!.\n" ; } #open destination file if (open (OUTPUT, ">>$outputFile")) { $opt_v and print "opened file to write: $outputFile.\n" ; } else { die "Can not write to $outputFile: $!.\n" ; } $opt_v and print "finding $id numbers\n" ; while (my $line = <INPUT> ) { chomp ($line) ; if ($key = &getIDnumber($line,$id)) { print OUTPUT "$key\n" ; } ; } $opt_v and print "done\n" ; exit; #################### END of Main Program ####################### ################################################################## sub getIDnumber { my($line, $id) = @_ ; my $IDtoReturn = '' ; my $IDfromDef = '' ; if ($line =~ /^>/) { # fasta Definition Line if (($id eq 'gi') and ($line =~ /\/gi=.*?\s/)) { $IDtoReturn= $1 ; chomp($IDtoReturn) ; } elsif (($id eq 'gb') and ($line =~ /\/gb=.*?\s/)) { $IDfromDef = $&; $IDtoReturn = substr($IDfromDef, 4) ; chomp($IDtoReturn) ; } } elsif ($line =~ /^$/){ # Blank line # do nothing } else { #sequence # do nothing } return $IDtoReturn }

Suggestions on how we could do this without opening and closing the filehandle would be most welcome.

Thanks in advance for your replies.

MadraghRua
yet another biologist hacking perl....

Replies are listed 'Best First'.
Re: Buffering File Output
by clintp (Curate) on Jan 23, 2002 at 04:56 UTC
    Do something like this once, after opening the output filehandle:
    select(OUTPUT); # Change the default filehandle $|=1; # Turn on autoflushing for default f.h. select(STDOUT); # Go back to normal default f.h.
    More importantly:

    When processes exit normally (via the exit function or simply falling off the end of the script) pending output on filehandles is flushed and the filehandles are closed. What's curious is that this isn't happening for you. Are you terminating the script in some odd way? Segfaulting it? Killing it with a signal?

      Thanks for the reply. No, nothing is used to terminate the script. What finishes the script is the end of file marker in the input file. So when the script finishes running, all the records are printed to screen and about 200 lines are not printed to the output file.

      MadraghRua
      yet another biologist hacking perl....

Re: Buffering File Output
by VSarkiss (Monsignor) on Jan 23, 2002 at 05:40 UTC

    From your other response, I suspect the problem isn't related to buffering, contrary to what you've titled your node.

    Your code will only print a line if your getIDnumber sub returns a true value. Are you sure that the last lines of your input start with a > character (with nothing intervening)? Otherwise the sub will return an empty string, which represents falsehood in Perl, which will cause your program to not print anything.

    You said it works when you open and close the filehandle, but how are you doing that? Is the variable $ID_LIMIT involved? (You should be getting a "used only once" warning on that one. ;-) My suspicion is that it's something incidental to how you're doing the close and re-open. Maybe if you posted that other code, the bug would be easier to spot.

      Your code will only print a line if your getIDnumber sub returns a true value. Are you sure that the last lines of your input start with a > character (with nothing intervening)? Yes, we're quite sure. The files we are working with are DNA sequence data from the Unigene collection at NCBI. The actual files can be found in here. The last lines of the file look like this: >gnl|UG|Ta#S75229 Triticum aestivum putative CRT/DRE-binding factor (CBF1) mRNA, complete cds /cds=(18,656) /gb=AF376136 /gi=17226800 /ug=Ta.2781 /len=800 TTTTTGACGCTGCAACTGATGGACACCGCCGCTGCCGGCTCCCCGCGTGAGGGGCACAGG ACGGTGTGCTCGGAGCCGCCCAAGAGGCCGGCAGGGCGGACCAAGTTCAGGGAGACGCGC CACCCGCTGTACCGCGGCGTGCGGCGCCGGGGCCGGCTCGGGCAGTGGGTGTGCGAGGTT CGCGTGCGCGGCGCGCAAGGGTACAGGCTCTGGCTCGGCACCTTCACCACTGCCGAGATG GCGGCGCGCGCGCACGACTCCGCCGTGCTCGCGCTCCTCGACCGCGCCGCCTGCCTCAAC TTCGCCGACTCCGCCTGGCGGATGCTGCCCGTCCTCGCGGCTGGCTCGTCCCGCTTCAGC AGCGCGCGGGAGATCAAGGACGCCGTCGCCATCGCCGTCCTGGAGTTCCAGCGGCAGCGC CCCGTCGTGTCAACGTCGGAGATGCACGACGGCGAAAAGGACGCCCAAGGCTCGCCGACG CCGAGCGAGCTGTCCACGTCCAGCGACTTGTTGGACGAGCACTGGTTTGGCGGCATGGAC GCCGGCTCGTACTACGCGAGCTTGGCGCAGGGGATGCTCATGGAGCCGCCGTCCGCCAGA ACGTGGAGCGAGGATGGCGGCGAATACAGCGCCGTCTACACGCCGCTTTGGAACTAATTA TCCGACTAATTAAGCCATGTACAGTTTTGGAAACTACTCCCTCGGTAAACTAATATAAGA GCATTTAAATCATTAAAATAGTGATCTAAACACTCTTATATTAAGTTTACGGAGGGAGTA GGCTACTAGTGGTTGTGTTG

      We're intersted in the lines beginning with a greater than sign, and we want to collect the information /gb=AF376136, gi=17226800 or both. We are not too concerned about the last couple of lines as they do not have description lines starting with the greater than sign.

      What we do see when we ran the earlier script against the Ta.seq.uniq file, is that all the GI numbers were collected correctly and printed to Standard Out. Unfortunately the last 200 or so GI numbers are not printed to the Output filehandle. Looking at the code, there is nothing strange or unusual going on about the opening or closing of the file handles - we do this quite a lot. Instead, I do think that problem is the large numbers of GI and GB numbers being printed to the Standard Out buffer.

      Thanks you for noticing the $ID_LIMIT variable. The newest piece of code does not use this one. I'm enclosing the code here:

      #!/usr/bin/perl -w package main; require 5.004; use strict; use Getopt::Std; ########## CONSTANTS ############### #my $ID_LIMIT = 10000 ; my $BUFFER_LIST = 2000 ; # WRITE BUFFER Size of array accumulated befo +re writing to file. ########### VARIABLES #################### use vars qw($opt_h $opt_v $opt_d $opt_i $opt_o $opt_g $opt_p) ; use vars qw($inputFile $outputFile $id $sequenceType) ; use vars qw($number $cntr_idNumber @outputBuffer) ; use vars qw($numRecords) ; getopts('i:o:g:p:hvd') ; $opt_h and die <<"QQ_HELP_QQ" ; DESCRIPTION This program reads a fasta 'Genbank formatted file', retrieves the gb numbers by default, or gi numbers if specified and writes the numbers to a text file AVAILABILITY Requires perl 5.004 or higher SYNOPSIS makeGIList.pl -h -i [path/fileToRead] -o [path/fileToWrite] OPTIONS -h Print help message -v Verbose mode -i [path/fileToRead] File to process -o [path/fileToWrite] Destination -g [gi|gb|both] Make file with GI or Accession number +or both -p [n|p] nucleotide or protein, DEFAULT = 'n' EXIT STATUS 0 Fail 1 Successful completion QQ_HELP_QQ #undocumented in help ... # DEBUG MODE if ($opt_d) { $opt_v = 'v' ; $opt_i = "D:\\datasets\\ncbi\\UniGene\\Ta.seq.uniq" ; $opt_o = "D:\\datasets\\ncbi\\UniGene\\gbnumTest.id" ; } # Input path parameter if ($opt_i) { $inputFile = $opt_i ; } else { die "an input file with its path is required" ; } # output path and filename if ($opt_o) { $outputFile = $opt_o ; } else { die "an ouput file with its path is required" ; } # id to fetch, GI, Accession, or both if ($opt_g) { if (($opt_g eq 'gi') or ($opt_g eq 'GI')) { $id = 'gi' ; } elsif (($opt_g eq 'gb') or ($opt_g eq 'GB')) { $id = 'gi' ; } elsif (($opt_g eq 'both') or ($opt_g eq 'BOTH')) { $id = 'both' ; } } else { $opt_g = 'gb' ; $id = $opt_g ; } # type of file if ($opt_p) { if (($opt_p eq 'n') or ($opt_p eq 'N')) { $sequenceType = 'n'; } elsif (($opt_p eq 'p') or ($opt_p eq 'P')) { $sequenceType = 'p' ; } } else { $opt_p = 'n' ; #default 'n' nucleotide $sequenceType = $opt_p ; } # Debug mode # print out variables mapped from input flags if ($opt_d) { print "\$opt_d: $opt_d DEBUG mode on" ; # undocumented print "\$opt_h: $opt_h\n" ; print "\$opt_v: $opt_v\n" ; print "\$opt_i: $opt_i ==> \$inputFile: $inputFile\n" ; print "\$opt_o: $opt_o ==> \$outputFile: $outputFile\n" ; print "\$opt_g: $opt_g ==> \$id: $id\n" ; print "\$opt_p: $opt_p\==> \$sequenceType: $sequenceType\n" ; } #################### END OF INPUT FLAGS ######################## ################################################################# #################### Main Program ############################## #open source file if (open (INPUT, "<$inputFile")) { $opt_v and print "opened file $inputFile\n" ; } else { die "Can not open $inputFile: $!.\n" ; } #open destination file if (open (OUTPUT, ">>$outputFile")) { $opt_v and print "opened file to write: $outputFile.\n" ; } else { die "Can not write to $outputFile: $!.\n" ; } #print first line of destination file #specify whether accession numbers are for 'n' nucleotides or 'p' pro +teins. if ($sequenceType eq 'n') { print OUTPUT ">nucleotide\n" ; close OUTPUT or die "can not close $outputFile $!\n" ; $opt_v and print "closed file to write after writing first line\n\ +n" } elsif ($sequenceType eq 'p') { print OUTPUT ">protein\n" ; close OUTPUT or die "can not close $outputFile $!\n" ; $opt_v and print "closed file to write after writing first line\n\ +n" } $opt_v and print "finding $id numbers\n" ; #verbose mode $cntr_idNumber = 0 ; #count records + whiles we find the id numbers @outputBuffer = () ; #declare outpu +t buffer while (my $line = <INPUT> ) { #read the inpu +t file line by line chomp ($line) ; if ($number = &getIDnumber($line,$id)) { $cntr_idNumber += 1 ; #increment id +counter $opt_v and print "$cntr_idNumber) $number\n" ; push @outputBuffer, $number ; # add id to bu +ffer if (scalar @outputBuffer == $BUFFER_LIST) { # buffered out +put, print evevery $BUFFER_LIST lines if (open (OUTPUT, ">>$outputFile")) { $opt_v and print "opened file to write $BUFFER_LIST id +s: $outputFile.\n" ; } else { die "Can not open $inputFile: $!.\n" ; } foreach my $item (@outputBuffer) { print OUTPUT "$item\n" ; } @outputBuffer = (); #reset buffer close OUTPUT or die "can not close $outputFile after writi +ng $BUFFER_LIST ids $!\n" ; $opt_v and print "Closed $outputFile after writing $BUFFER +_LIST ids\n\n" ; } } } # print remainder of ids # Flush rest of the buffer if (open (OUTPUT, ">>$outputFile")) { $opt_v and print "opened file to write remainder of ids: $outputF +ile.\n" ; } else { die "Can not open $inputFile: $!.\n" ; } foreach my $item (@outputBuffer) { #print out remaining id's from buf +fer print OUTPUT "$item\n" ; } close OUTPUT ; # close filehandle @outputBuffer = () ; #clear buffer $opt_v and print "Closed $outputFile after writing remainder of ids\n\ +n" ; $opt_v and print "$cntr_idNumber records\n" ; $opt_v and print "done\n" ; exit; #################### END of Main Program ####################### ################################################################## sub getIDnumber { my($line, $id) = @_ ; my $IDtoReturn = '' ; my $IDfromDef = '' ; if ($line =~ /^>/) { # fasta Definition Line $opt_d and print "\$line: $line\n"; if (($id eq 'gi') and ($line =~ /\/gi=.*?\s/)) { $IDtoReturn= $& ; chomp($IDtoReturn) ; } elsif (($id eq 'gb') and ($line =~ /\/gb=.*?\s/)) { $opt_d and print "$&: " ; $IDfromDef = $&; $IDtoReturn = substr($IDfromDef, 4) ; chomp($IDtoReturn) ; } } elsif ($line =~ /^$/){ # Blank line # do nothing } else { #sequence # do nothing } return $IDtoReturn }
      The main difference this time is that we are writing output as chunks - $BUFFER_LIST variable. This does the job quite nicely. We will implement the buffer flushing $| = 1; technique mentioned in the earlier emails.

      MadraghRua
      yet another biologist hacking perl....

(Ovid) Re: Buffering File Output
by Ovid (Cardinal) on Jan 23, 2002 at 05:03 UTC
Re: Buffering File Output
by jsegal (Friar) on Jan 23, 2002 at 22:38 UTC
    This doesn't directly answer your question (that has been addressed already), but you should check out the BioPerl package, also available at bioperl.org. Among other things in this package, there are modules for parsing sequences in many formats (including FASTA) -- using these, your code will never need to search explicity for ">" to begin a line, and you will be able to think more about answering your questions and less about parsing the files. As a byproduct, your code will be much simpler to read and maintain, and problems (like this one) will be easier to track down and debug.

    Good luck!