I added huge IO in the title because I was unable to replicate the break with a 20,000 line test file. My program has worked on multiple windows machines but all of them have perl v5.12.3 or higher. My program has failed on 5 Macs with perl versions ranging from 5.8.x to 5.10.0 When I make a much smaller sample file, it has worked on a perlv5.10.0 Mac.

There are two primary files my program reads in. They are altered FASTA files (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL) from this website http://www.uniprot.org/downloads These files are ~250MB and ~10GB respectively. I altered them to have each sequence on 1 line with the format "$annotativeInformation\t$aminoacidSequence\t$taxonomyCode\n". The following is an example of one line of the altered data file. >sp|P48255|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1    MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK    CYAPA The taxonomy code is at the end of the first part of the annotation info directly following an underscore. I added it extra at the end to make it easier to pull out when my program was reading through a million lines.

My problem started when I added a taxonomy filter. Basically a user would specify a species name "Cyanophora" and my program would parse through the taxonomy datafile which I downloaded in tab-delimited format, here is an example

Taxon Mnemonic Scientific name Common name Synonym Oth +er Names Reviewed Rank Lineage Parent 43989 CYAA5 Cyanothece sp. (strain ATCC 51142) Cyanot +hece (strain ATCC 51142); Cyanothece 51142; Cyanothece ATCC51142; Cya +nothece sp. ATCC 51142; Cyanothece sp. BH68; Cyanothece sp. BH68K +reviewed Species Bacteria; Cyanobacteria; Chroococcales; Cyanot +hece 43988
Basically if whatever the user put in quotes up there matched anything found on a line in the taxonomy file, I would take the second item of the list and add it to an array. If the taxonomy code of the current line existed in the array, I would then send it on to the rest of the program

here is the relevant code
my $slash = "/"; if("$^O" eq "MSWin32"){ $slash = "\\"; } # restrict the search to a specific taxonomy my $taxon = $ARGV[3]; $annotation .= "\t$taxon"; open(tax_file, "..".$slash."dataset".$slash."taxonomy.tab") or die "co +uldn't open taxonomy.tab"; my @taxR = <tax_file>; close tax_file; if($taxon){ @taxR = grep { /$taxon/i } @taxR; for(my $e = 0; $e < scalar(@taxR); $e++){ my @taxRR = split(/\t/, $taxR[$e]); $taxR[$e] = $taxRR[1]; } } my %taxR = map { $_ => 1 } @taxR; print "cyaa5 = ".$taxR{"CYAA5"};#prints cyaa5 = 1 #skipping a bunch of unrelated stuff open(ps_file, "..".$slash."dataset".$slash.$tempFile) or die "coul +dn't open $tempFile"; while(<ps_file>){ chomp; my @curLine = split(/\t/, $_); my $filter = 1; if($taxon){ print "$curLine[2]\t$taxR{$curLine[2]}\n";#produced weird +output with when run with the huge protein file will post below #these commented out lines are previous attemp +ts that work on windows but not Mac #$filter = $curLine[2] ~~ @taxR; #$filter = scalar(grep( /^$curLine[2]$/, @taxR )); #$filter = ( first { $_ eq $curLine[2] } @taxR ); #print $taxR{curLine[2]}."\n"; $filter = $taxR{$curLine[2]}; } if($filter){ checkSeq(@curLine); } } close ps_file;
sample output
from print in while loop on mac with massive $tempFile, about one million lines of it
FRG3G FRG3G IIV3 IIV3 FRG3G FRG3G IIV3 FRG3G IIV6 FRG3G FRG3G
sample output from print in loop on windows with massive $tempFile
GLOVI 1 GRATL PORPU PORYE PROM0 1 PROM2 1 PROM3 1 PROM4 1 PROM5 1 PROM9 1
sample output from print in loop on the mac with a small $tempFile
ECOSM ECOUT ENT38 ERWCT ESCF3 PECCP CYAP4 1 DEIRA DELAS DESAP DESHY

In reply to Breaks on Mac but not Windows or Linux - huge IO by vivomancer

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.