vivomancer has asked for the wisdom of the Perl Monks concerning the following question:
I added huge IO in the title because I was unable to replicate the break with a 20,000 line test file. My program has worked on multiple windows machines but all of them have perl v5.12.3 or higher. My program has failed on 5 Macs with perl versions ranging from 5.8.x to 5.10.0 When I make a much smaller sample file, it has worked on a perlv5.10.0 Mac.
There are two primary files my program reads in. They are altered FASTA files (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL) from this website http://www.uniprot.org/downloads These files are ~250MB and ~10GB respectively. I altered them to have each sequence on 1 line with the format "$annotativeInformation\t$aminoacidSequence\t$taxonomyCode\n". The following is an example of one line of the altered data file. >sp|P48255|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1 MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK CYAPA The taxonomy code is at the end of the first part of the annotation info directly following an underscore. I added it extra at the end to make it easier to pull out when my program was reading through a million lines.
My problem started when I added a taxonomy filter. Basically a user would specify a species name "Cyanophora" and my program would parse through the taxonomy datafile which I downloaded in tab-delimited format, here is an example
Taxon Mnemonic Scientific name Common name Synonym Oth +er Names Reviewed Rank Lineage Parent 43989 CYAA5 Cyanothece sp. (strain ATCC 51142) Cyanot +hece (strain ATCC 51142); Cyanothece 51142; Cyanothece ATCC51142; Cya +nothece sp. ATCC 51142; Cyanothece sp. BH68; Cyanothece sp. BH68K +reviewed Species Bacteria; Cyanobacteria; Chroococcales; Cyanot +hece 43988
my $slash = "/"; if("$^O" eq "MSWin32"){ $slash = "\\"; } # restrict the search to a specific taxonomy my $taxon = $ARGV[3]; $annotation .= "\t$taxon"; open(tax_file, "..".$slash."dataset".$slash."taxonomy.tab") or die "co +uldn't open taxonomy.tab"; my @taxR = <tax_file>; close tax_file; if($taxon){ @taxR = grep { /$taxon/i } @taxR; for(my $e = 0; $e < scalar(@taxR); $e++){ my @taxRR = split(/\t/, $taxR[$e]); $taxR[$e] = $taxRR[1]; } } my %taxR = map { $_ => 1 } @taxR; print "cyaa5 = ".$taxR{"CYAA5"};#prints cyaa5 = 1 #skipping a bunch of unrelated stuff open(ps_file, "..".$slash."dataset".$slash.$tempFile) or die "coul +dn't open $tempFile"; while(<ps_file>){ chomp; my @curLine = split(/\t/, $_); my $filter = 1; if($taxon){ print "$curLine[2]\t$taxR{$curLine[2]}\n";#produced weird +output with when run with the huge protein file will post below #these commented out lines are previous attemp +ts that work on windows but not Mac #$filter = $curLine[2] ~~ @taxR; #$filter = scalar(grep( /^$curLine[2]$/, @taxR )); #$filter = ( first { $_ eq $curLine[2] } @taxR ); #print $taxR{curLine[2]}."\n"; $filter = $taxR{$curLine[2]}; } if($filter){ checkSeq(@curLine); } } close ps_file;
sample output from print in loop on windows with massive $tempFileFRG3G FRG3G IIV3 IIV3 FRG3G FRG3G IIV3 FRG3G IIV6 FRG3G FRG3G
sample output from print in loop on the mac with a small $tempFileGLOVI 1 GRATL PORPU PORYE PROM0 1 PROM2 1 PROM3 1 PROM4 1 PROM5 1 PROM9 1
ECOSM ECOUT ENT38 ERWCT ESCF3 PECCP CYAP4 1 DEIRA DELAS DESAP DESHY
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Breaks on Mac but not Windows - huge IO
by frozenwithjoy (Priest) on Jun 27, 2012 at 02:49 UTC | |
by vivomancer (Initiate) on Jun 27, 2012 at 03:59 UTC | |
by vivomancer (Initiate) on Jun 27, 2012 at 16:49 UTC | |
by Anonymous Monk on Jun 28, 2012 at 07:22 UTC | |
|
Re: Breaks on Mac but not Windows or Linux - huge IO
by vivomancer (Initiate) on Jun 28, 2012 at 16:53 UTC |