comment on

I added huge IO in the title because I was unable to replicate the break with a 20,000 line test file. My program has worked on multiple windows machines but all of them have perl v5.12.3 or higher. My program has failed on 5 Macs with perl versions ranging from 5.8.x to 5.10.0 When I make a much smaller sample file, it has worked on a perlv5.10.0 Mac.

There are two primary files my program reads in. They are altered FASTA files (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL) from this website http://www.uniprot.org/downloads These files are ~250MB and ~10GB respectively. I altered them to have each sequence on 1 line with the format "$annotativeInformation\t$aminoacidSequence\t$taxonomyCode\n". The following is an example of one line of the altered data file. >sp|P48255|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1 MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK CYAPA The taxonomy code is at the end of the first part of the annotation info directly following an underscore. I added it extra at the end to make it easier to pull out when my program was reading through a million lines.

My problem started when I added a taxonomy filter. Basically a user would specify a species name "Cyanophora" and my program would parse through the taxonomy datafile which I downloaded in tab-delimited format, here is an example

 Taxon    Mnemonic    Scientific name    Common name    Synonym    Oth
+er Names    Reviewed    Rank    Lineage    Parent
43989    CYAA5    Cyanothece sp. (strain ATCC 51142)            Cyanot
+hece (strain ATCC 51142); Cyanothece 51142; Cyanothece ATCC51142; Cya
+nothece sp. ATCC 51142; Cyanothece sp. BH68; Cyanothece sp. BH68K    
+reviewed    Species    Bacteria; Cyanobacteria; Chroococcales; Cyanot
+hece    43988
[download]

Basically if whatever the user put in quotes up there matched anything found on a line in the taxonomy file, I would take the second item of the list and add it to an array. If the taxonomy code of the current line existed in the array, I would then send it on to the rest of the program

here is the relevant code

my $slash = "/";
if("$^O" eq "MSWin32"){
    $slash = "\\";
}
# restrict the search to a specific taxonomy
my $taxon = $ARGV[3];
$annotation .= "\t$taxon";
open(tax_file, "..".$slash."dataset".$slash."taxonomy.tab") or die "co
+uldn't open taxonomy.tab";
my @taxR = <tax_file>;
close tax_file;
if($taxon){
    @taxR = grep { /$taxon/i } @taxR;
    for(my $e = 0; $e < scalar(@taxR); $e++){
        my @taxRR = split(/\t/, $taxR[$e]);
        $taxR[$e] = $taxRR[1];
    }
}
my %taxR = map { $_ => 1 } @taxR;
print "cyaa5 = ".$taxR{"CYAA5"};#prints cyaa5 = 1
#skipping a bunch of unrelated stuff
    open(ps_file, "..".$slash."dataset".$slash.$tempFile) or die "coul
+dn't open $tempFile";
    while(<ps_file>){
        chomp;
        
        
        my @curLine = split(/\t/, $_);
        my $filter = 1;
        if($taxon){
            print "$curLine[2]\t$taxR{$curLine[2]}\n";#produced weird 
+output with when run with the huge protein file will post below
                        #these commented out lines are previous attemp
+ts that work on windows but not Mac
            #$filter = $curLine[2] ~~ @taxR;
            #$filter = scalar(grep( /^$curLine[2]$/, @taxR ));
            #$filter = ( first { $_ eq $curLine[2] } @taxR );
            #print $taxR{curLine[2]}."\n";
            $filter = $taxR{$curLine[2]};
        }
        if($filter){
            checkSeq(@curLine);
        }
    }
    close ps_file;
[download]

sample output

from print in while loop on mac with massive $tempFile, about one million lines of it

FRG3G
    
FRG3G
    
IIV3
    
IIV3
    
FRG3G
    
FRG3G
    
IIV3
    
FRG3G
    
IIV6
    
FRG3G
    
FRG3G
[download]

sample output from print in loop on windows with massive $tempFile

GLOVI    1
GRATL    
PORPU    
PORYE    
PROM0    1
PROM2    1
PROM3    1
PROM4    1
PROM5    1
PROM9    1
[download]

sample output from print in loop on the mac with a small $tempFile

ECOSM    
ECOUT    
ENT38    
ERWCT    
ESCF3    
PECCP    
CYAP4    1
DEIRA    
DELAS    
DESAP    
DESHY
[download]

In reply to Breaks on Mac but not Windows or Linux - huge IO by vivomancer

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.