comment on

Hi Folks

We have a script to write output to a filehandle. The file handle will be accepting maybe 100,000 + lines of output. We keep finding that although the script appears to correctly grab and extract the data we are intersted in, it does not write all this data to the output file - instead we are always missing the final couple of hundred lines of data. We have gotten around the problem by opening the output file handle, writing 1000 lines of data and then closing the output file handle, repeating this each time we need to to go through the entire data set. This does seem a tad kludgey though.

#!/usr/bin/perl -w

package main;

require 5.004;

use strict;
use Getopt::Std;

########## CONSTANTS  ###############

my $ID_LIMIT = 10000 ;

use vars qw($opt_h $opt_v $opt_d $opt_i $opt_o $opt_g $opt_p) ;
use vars qw($inputFile $outputFile $id $sequenceType) ;
use vars qw($key) ;

use vars qw($numRecords) ;

getopts('i:o:g:p:hvd') ;

$opt_h and die <<"QQ_HELP_QQ" ;

DESCRIPTION

     This program reads a fasta 'Genbank formatted file', retrieves th
+e gi
numbers,
     and writes the numbers to a text file

AVAILABILITY

     Requires perl 5.004 or higher

SYNOPSIS

     makeGIList.pl -h -i [path/fileToRead] -o [path/fileToWrite]

OPTIONS

    -h                          Print help message
    -v                          Verbose mode
    -i      [path/fileToRead]      File to process
    -o      [path/fileToWrite]    Destination
    -g    [gi|gb|both]          Make file with GI or Accession number 
+or
both
    -p      [n|p]            nucleotide or protein, DEFAULT = 'n'


EXIT STATUS

     0        Fail
     1        Successful completion

QQ_HELP_QQ

if ($opt_d) {
    $opt_v = 'v' ;
    $opt_i = "D:\\datasets\\ncbi\\UniGene\\Ta.seq.uniq" ;
    $opt_o = "D:\\datasets\\ncbi\\UniGene\\gbnumTest.id" ;
}


# Input path parameter
if ($opt_i) {
    $inputFile = $opt_i ;
} else {
    die "an input file with its path is required" ;
}


# output path and filename
if ($opt_o) {
    $outputFile = $opt_o ;
} else {
    die "an ouput file with its path is required" ;
}


# id to fetch, GI, Accession, or both
if ($opt_g) {
    if (($opt_g eq 'gi') or ($opt_g eq 'GI')) {
        $id = 'gi' ;
    } elsif (($opt_g eq 'gb') or ($opt_g eq 'GB')) {
        $id = 'gi' ;
    } elsif (($opt_g eq 'both') or ($opt_g eq 'BOTH')) {
        $id = 'both' ;
    }
} else {
    $opt_g = 'gb' ;
    $id    = $opt_g ;
}


# type of file
if ($opt_p) {
    if (($opt_p eq 'n') or ($opt_p eq 'N')) {
        $sequenceType = 'n';
    } elsif (($opt_p eq 'p') or ($opt_p eq 'P')) {
        $sequenceType = 'p' ;
    }
} else {
    $opt_p = 'p' ;
    $sequenceType = $opt_p ;
}

# Debug mode
# print out variables mapped from input flags

if ($opt_d) {
    print "\$opt_d: $opt_d  DEBUG mode on" ;      # undocumented
    print "\$opt_h: $opt_h\n" ;
    print "\$opt_v: $opt_v\n" ;
    print "\$opt_i: $opt_i ==> \$inputFile:    $inputFile\n" ;
    print "\$opt_o: $opt_o ==> \$outputFile:   $outputFile\n" ;
    print "\$opt_g: $opt_g ==> \$id:           $id\n" ;
    print "\$opt_p: $opt_p\==> \$sequenceType: $sequenceType\n" ;
}

####################  END OF INPUT FLAGS ########################

#################################################################
####################  Main Program ##############################

#open source file
if (open (INPUT, "<$inputFile")) {
    $opt_v and print "opened file $inputFile\n" ;
} else {
    die "Can not open $inputFile: $!.\n" ;
}

#open destination file
if (open (OUTPUT, ">>$outputFile")) {
    $opt_v and print "opened file to write:  $outputFile.\n" ;
} else {
   die "Can not write to $outputFile: $!.\n" ;
}

$opt_v and print "finding $id numbers\n" ;
while (my $line = <INPUT> ) {
    chomp ($line) ;
    if ($key = &getIDnumber($line,$id)) {
        print  OUTPUT "$key\n" ;
    } ;
}

$opt_v and print "done\n" ;
exit;




####################  END of Main Program  #######################
##################################################################


sub getIDnumber {
    my($line, $id) = @_ ;
    my $IDtoReturn = '' ;
    my $IDfromDef = '' ;
    if ($line =~ /^>/) {                # fasta Definition Line
        if (($id eq 'gi') and ($line =~ /\/gi=.*?\s/)) {
            $IDtoReturn= $1 ;
            chomp($IDtoReturn) ;
        } elsif (($id eq 'gb') and ($line =~ /\/gb=.*?\s/)) {
            $IDfromDef = $&;
            $IDtoReturn = substr($IDfromDef, 4) ;
            chomp($IDtoReturn) ;
        }

    } elsif ($line =~ /^$/){        # Blank line
        # do nothing
    } else {                            #sequence
        # do nothing
    }
    return $IDtoReturn
}
[download]

Suggestions on how we could do this without opening and closing the filehandle would be most welcome.

Thanks in advance for your replies.

MadraghRua
yet another biologist hacking perl....

In reply to Buffering File Output by MadraghRua

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.