comment on

Your code will only print a line if your getIDnumber sub returns a true value. Are you sure that the last lines of your input start with a > character (with nothing intervening)? Yes, we're quite sure. The files we are working with are DNA sequence data from the Unigene collection at NCBI. The actual files can be found in here. The last lines of the file look like this: >gnl|UG|Ta#S75229 Triticum aestivum putative CRT/DRE-binding factor (CBF1) mRNA, complete cds /cds=(18,656) /gb=AF376136 /gi=17226800 /ug=Ta.2781 /len=800 TTTTTGACGCTGCAACTGATGGACACCGCCGCTGCCGGCTCCCCGCGTGAGGGGCACAGG ACGGTGTGCTCGGAGCCGCCCAAGAGGCCGGCAGGGCGGACCAAGTTCAGGGAGACGCGC CACCCGCTGTACCGCGGCGTGCGGCGCCGGGGCCGGCTCGGGCAGTGGGTGTGCGAGGTT CGCGTGCGCGGCGCGCAAGGGTACAGGCTCTGGCTCGGCACCTTCACCACTGCCGAGATG GCGGCGCGCGCGCACGACTCCGCCGTGCTCGCGCTCCTCGACCGCGCCGCCTGCCTCAAC TTCGCCGACTCCGCCTGGCGGATGCTGCCCGTCCTCGCGGCTGGCTCGTCCCGCTTCAGC AGCGCGCGGGAGATCAAGGACGCCGTCGCCATCGCCGTCCTGGAGTTCCAGCGGCAGCGC CCCGTCGTGTCAACGTCGGAGATGCACGACGGCGAAAAGGACGCCCAAGGCTCGCCGACG CCGAGCGAGCTGTCCACGTCCAGCGACTTGTTGGACGAGCACTGGTTTGGCGGCATGGAC GCCGGCTCGTACTACGCGAGCTTGGCGCAGGGGATGCTCATGGAGCCGCCGTCCGCCAGA ACGTGGAGCGAGGATGGCGGCGAATACAGCGCCGTCTACACGCCGCTTTGGAACTAATTA TCCGACTAATTAAGCCATGTACAGTTTTGGAAACTACTCCCTCGGTAAACTAATATAAGA GCATTTAAATCATTAAAATAGTGATCTAAACACTCTTATATTAAGTTTACGGAGGGAGTA GGCTACTAGTGGTTGTGTTG

We're intersted in the lines beginning with a greater than sign, and we want to collect the information /gb=AF376136, gi=17226800 or both. We are not too concerned about the last couple of lines as they do not have description lines starting with the greater than sign.

What we do see when we ran the earlier script against the Ta.seq.uniq file, is that all the GI numbers were collected correctly and printed to Standard Out. Unfortunately the last 200 or so GI numbers are not printed to the Output filehandle. Looking at the code, there is nothing strange or unusual going on about the opening or closing of the file handles - we do this quite a lot. Instead, I do think that problem is the large numbers of GI and GB numbers being printed to the Standard Out buffer.

Thanks you for noticing the $ID_LIMIT variable. The newest piece of code does not use this one. I'm enclosing the code here:

#!/usr/bin/perl -w

package main;

require 5.004;

use strict;
use Getopt::Std;

########## CONSTANTS  ###############

#my $ID_LIMIT = 10000 ;
my $BUFFER_LIST = 2000 ; # WRITE BUFFER Size of array accumulated befo
+re writing to file.


########### VARIABLES ####################
use vars qw($opt_h $opt_v $opt_d $opt_i $opt_o $opt_g $opt_p) ;
use vars qw($inputFile $outputFile $id $sequenceType) ;
use vars qw($number $cntr_idNumber @outputBuffer) ;

use vars qw($numRecords) ;

getopts('i:o:g:p:hvd') ;

$opt_h and die <<"QQ_HELP_QQ" ;

DESCRIPTION

     This program reads a fasta 'Genbank formatted file', retrieves
     the gb numbers by default, or gi numbers if specified
     and writes the numbers to a text file

AVAILABILITY

     Requires perl 5.004 or higher

SYNOPSIS

     makeGIList.pl -h -i [path/fileToRead] -o [path/fileToWrite]

OPTIONS

    -h                          Print help message
    -v                          Verbose mode
    -i      [path/fileToRead]      File to process
    -o      [path/fileToWrite]    Destination
    -g    [gi|gb|both]          Make file with GI or Accession number 
+or both
    -p      [n|p]            nucleotide or protein, DEFAULT = 'n'    


EXIT STATUS

     0        Fail
     1        Successful completion


     
QQ_HELP_QQ

#undocumented in help ...
# DEBUG MODE
if ($opt_d) {
    $opt_v = 'v' ;
    $opt_i = "D:\\datasets\\ncbi\\UniGene\\Ta.seq.uniq" ;
    $opt_o = "D:\\datasets\\ncbi\\UniGene\\gbnumTest.id" ;
}


# Input path parameter
if ($opt_i) {
    $inputFile = $opt_i ;
} else {
    die "an input file with its path is required" ;
}


# output path and filename
if ($opt_o) {
    $outputFile = $opt_o ;
} else {  
    die "an ouput file with its path is required" ;
}


# id to fetch, GI, Accession, or both
if ($opt_g) {
    if (($opt_g eq 'gi') or ($opt_g eq 'GI')) {
        $id = 'gi' ;
    } elsif (($opt_g eq 'gb') or ($opt_g eq 'GB')) {
        $id = 'gi' ;
    } elsif (($opt_g eq 'both') or ($opt_g eq 'BOTH')) {
        $id = 'both' ;
    }
} else {
    $opt_g = 'gb' ;
    $id    = $opt_g ;
}


# type of file
if ($opt_p) {
    if (($opt_p eq 'n') or ($opt_p eq 'N')) {
        $sequenceType = 'n';
    } elsif (($opt_p eq 'p') or ($opt_p eq 'P')) {
        $sequenceType = 'p' ;
    }
} else {
    $opt_p = 'n' ;          #default 'n' nucleotide
    $sequenceType = $opt_p ;
}
    
# Debug mode
# print out variables mapped from input flags 
if ($opt_d) {
    print "\$opt_d: $opt_d  DEBUG mode on" ;      # undocumented 
    print "\$opt_h: $opt_h\n" ;
    print "\$opt_v: $opt_v\n" ;
    print "\$opt_i: $opt_i ==> \$inputFile:    $inputFile\n" ;
    print "\$opt_o: $opt_o ==> \$outputFile:   $outputFile\n" ;
    print "\$opt_g: $opt_g ==> \$id:           $id\n" ;
    print "\$opt_p: $opt_p\==> \$sequenceType: $sequenceType\n" ;
}
    
####################  END OF INPUT FLAGS ########################

#################################################################
####################  Main Program ##############################

#open source file
if (open (INPUT, "<$inputFile")) {
    $opt_v and print "opened file $inputFile\n" ;
} else {
    die "Can not open $inputFile: $!.\n" ;
}

#open destination file
if (open (OUTPUT, ">>$outputFile")) {
    $opt_v and print "opened file to write:  $outputFile.\n" ;
} else {
   die "Can not write to $outputFile: $!.\n" ;
}


#print first line of destination file
#specify whether accession numbers are for 'n' nucleotides or  'p' pro
+teins.
if ($sequenceType eq 'n') {
    print OUTPUT ">nucleotide\n" ;
    close OUTPUT or die "can not close $outputFile $!\n" ;
    $opt_v and print "closed file to write after writing first line\n\
+n"
} elsif ($sequenceType eq 'p') {
    print OUTPUT ">protein\n" ;
    close OUTPUT or die "can not close $outputFile $!\n" ;
    $opt_v and print "closed file to write after writing first line\n\
+n"
}



$opt_v and print "finding $id numbers\n" ;              #verbose mode
$cntr_idNumber = 0 ;                                    #count records
+ whiles we find the id numbers
@outputBuffer = () ;                                    #declare outpu
+t buffer
while (my $line = <INPUT> ) {                           #read the inpu
+t file line by line
    chomp ($line) ;
    if ($number = &getIDnumber($line,$id)) {
        $cntr_idNumber += 1 ;                           #increment id 
+counter
        $opt_v and print "$cntr_idNumber) $number\n" ;
        push @outputBuffer, $number ;                   # add id to bu
+ffer
        if (scalar @outputBuffer == $BUFFER_LIST) {     # buffered out
+put, print evevery $BUFFER_LIST lines
            if (open (OUTPUT, ">>$outputFile")) {
                $opt_v and print "opened file to write $BUFFER_LIST id
+s:  $outputFile.\n" ;
            } else {
                    die "Can not open $inputFile: $!.\n" ;
            }
            foreach my $item (@outputBuffer) { 
                    print  OUTPUT "$item\n" ;    
            }
            @outputBuffer = ();     #reset buffer
            close OUTPUT or die "can not close $outputFile after writi
+ng $BUFFER_LIST ids $!\n" ;
            $opt_v and print "Closed $outputFile after writing $BUFFER
+_LIST ids\n\n" ;
        }
    } 
}

# print remainder of ids
# Flush rest of the buffer
if (open (OUTPUT, ">>$outputFile")) {
    $opt_v and print "opened file to write remainder of ids:  $outputF
+ile.\n" ;
} else {
    die "Can not open $inputFile: $!.\n" ;
}
foreach my $item (@outputBuffer) {  #print out remaining id's from buf
+fer
    print OUTPUT "$item\n" ;
}
close OUTPUT ;          # close filehandle
@outputBuffer = () ;    #clear buffer 

$opt_v and print "Closed $outputFile after writing remainder of ids\n\
+n" ;
$opt_v and print "$cntr_idNumber records\n" ;
$opt_v and print "done\n" ;

exit;




####################  END of Main Program  ####################### 
################################################################## 


sub getIDnumber {
    my($line, $id) = @_ ;
    my $IDtoReturn = '' ;
    my $IDfromDef = '' ;
    if ($line =~ /^>/) {                # fasta Definition Line
        $opt_d and print "\$line: $line\n"; 
        if (($id eq 'gi') and ($line =~ /\/gi=.*?\s/)) {
            $IDtoReturn= $& ;
            chomp($IDtoReturn) ;
        } elsif (($id eq 'gb') and ($line =~ /\/gb=.*?\s/)) {
            $opt_d and print "$&:  " ;
            $IDfromDef = $&;
            $IDtoReturn = substr($IDfromDef, 4) ;
            chomp($IDtoReturn) ;
        }            
        
    } elsif ($line =~ /^$/){        # Blank line
        # do nothing
    } else {                            #sequence 
        # do nothing
    }
    return $IDtoReturn
}
[download]

The main difference this time is that we are writing output as chunks - $BUFFER_LIST variable. This does the job quite nicely. We will implement the buffer flushing $| = 1; technique mentioned in the earlier emails.

MadraghRua
yet another biologist hacking perl....

In reply to Re: Re: Buffering File Output by MadraghRua
in thread Buffering File Output by MadraghRua

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.