vivomancer has asked for the wisdom of the Perl Monks concerning the following question:

I added huge IO in the title because I was unable to replicate the break with a 20,000 line test file. My program has worked on multiple windows machines but all of them have perl v5.12.3 or higher. My program has failed on 5 Macs with perl versions ranging from 5.8.x to 5.10.0 When I make a much smaller sample file, it has worked on a perlv5.10.0 Mac.

There are two primary files my program reads in. They are altered FASTA files (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL) from this website http://www.uniprot.org/downloads These files are ~250MB and ~10GB respectively. I altered them to have each sequence on 1 line with the format "$annotativeInformation\t$aminoacidSequence\t$taxonomyCode\n". The following is an example of one line of the altered data file. >sp|P48255|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1    MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK    CYAPA The taxonomy code is at the end of the first part of the annotation info directly following an underscore. I added it extra at the end to make it easier to pull out when my program was reading through a million lines.

My problem started when I added a taxonomy filter. Basically a user would specify a species name "Cyanophora" and my program would parse through the taxonomy datafile which I downloaded in tab-delimited format, here is an example

Taxon Mnemonic Scientific name Common name Synonym Oth +er Names Reviewed Rank Lineage Parent 43989 CYAA5 Cyanothece sp. (strain ATCC 51142) Cyanot +hece (strain ATCC 51142); Cyanothece 51142; Cyanothece ATCC51142; Cya +nothece sp. ATCC 51142; Cyanothece sp. BH68; Cyanothece sp. BH68K +reviewed Species Bacteria; Cyanobacteria; Chroococcales; Cyanot +hece 43988
Basically if whatever the user put in quotes up there matched anything found on a line in the taxonomy file, I would take the second item of the list and add it to an array. If the taxonomy code of the current line existed in the array, I would then send it on to the rest of the program

here is the relevant code
my $slash = "/"; if("$^O" eq "MSWin32"){ $slash = "\\"; } # restrict the search to a specific taxonomy my $taxon = $ARGV[3]; $annotation .= "\t$taxon"; open(tax_file, "..".$slash."dataset".$slash."taxonomy.tab") or die "co +uldn't open taxonomy.tab"; my @taxR = <tax_file>; close tax_file; if($taxon){ @taxR = grep { /$taxon/i } @taxR; for(my $e = 0; $e < scalar(@taxR); $e++){ my @taxRR = split(/\t/, $taxR[$e]); $taxR[$e] = $taxRR[1]; } } my %taxR = map { $_ => 1 } @taxR; print "cyaa5 = ".$taxR{"CYAA5"};#prints cyaa5 = 1 #skipping a bunch of unrelated stuff open(ps_file, "..".$slash."dataset".$slash.$tempFile) or die "coul +dn't open $tempFile"; while(<ps_file>){ chomp; my @curLine = split(/\t/, $_); my $filter = 1; if($taxon){ print "$curLine[2]\t$taxR{$curLine[2]}\n";#produced weird +output with when run with the huge protein file will post below #these commented out lines are previous attemp +ts that work on windows but not Mac #$filter = $curLine[2] ~~ @taxR; #$filter = scalar(grep( /^$curLine[2]$/, @taxR )); #$filter = ( first { $_ eq $curLine[2] } @taxR ); #print $taxR{curLine[2]}."\n"; $filter = $taxR{$curLine[2]}; } if($filter){ checkSeq(@curLine); } } close ps_file;
sample output
from print in while loop on mac with massive $tempFile, about one million lines of it
FRG3G FRG3G IIV3 IIV3 FRG3G FRG3G IIV3 FRG3G IIV6 FRG3G FRG3G
sample output from print in loop on windows with massive $tempFile
GLOVI 1 GRATL PORPU PORYE PROM0 1 PROM2 1 PROM3 1 PROM4 1 PROM5 1 PROM9 1
sample output from print in loop on the mac with a small $tempFile
ECOSM ECOUT ENT38 ERWCT ESCF3 PECCP CYAP4 1 DEIRA DELAS DESAP DESHY

Replies are listed 'Best First'.
Re: Breaks on Mac but not Windows - huge IO
by frozenwithjoy (Priest) on Jun 27, 2012 at 02:49 UTC

    How much RAM do the different machines have? Is the file that causes problems when it gets large the taxonomy file used here?

    my @taxR = <tax_file>;

    If so, your problem may be that you are reading the entire file into memory at once (because each line is now an element of your array) and your machine is probably running out of memory. When you say the script fails, what exactly do you mean? Is there an error or? Does the process get killed by OOM Killer?

    EDIT: I re-read your question some more and realized that whatever file $tempFile is is the one that causes problems as it gets too large. Is that correct? What does that file look like? Also, is a non-zero value or string always assigned during $filter = $taxR{$curLine[2]};? If so, I'm not sure I understand the if-conditional for checkSeq(@curLine);. What is checkSeq doing?

    What happens if you run it on the Windows machine, but include use 5.10;? Just out of curiosity, you have use strict use warnings; and there are no errors, right?

      I'm about to go to sleep so I won't be able to run some of your tests until the morning

      taxonomy.tab is 800kb, the smaller $tempFile is 250,000kb, the larger is 10gb. When I use a 23,000kb $tempFile, the program completes sucessfully on both a Mac and PC

      The problem does occur when $tempFile gets too large. $tempfile is format "$annotativeInformation\t$aminoacidSequence\t$taxonomyCode\n". The following is an example of one line of $tempFile. >sp|P48255|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1 MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK CYAPA

      What I mean by the script failing is that, the hash %taxR is messed up, as shown in the 3 examples of output (2 good, 1 bad) which causes the program to not forward any lines of $tempFile to the rest of the program. There is no error. For the program to progress, some of the hash values must equal 1

      $filter is true if there is a value for $taxR{$curLineΐ]} which is built near the top

      checkSeq(@curLine) is the rest of my program which works no matter where I test it, if I set $filter to be equal to 1, the program doesn't work because $filter is never set to 1 because the hash seems to break when $tempFile is too large

      I use strict but I haven't used warning, I'll have to check that

      I changed the way it reads in tax_file to this
      my $taxon = $ARGV[3]; unless($taxon){ $taxon = "";#default is blank } $annotation .= "\t$taxon"; my @taxList = split(/\|/, $taxon); open(tax_file, "..".$slash."dataset".$slash."taxonomy.tab") or die "co +uldn't open taxonomy.tab"; #my @taxR = <tax_file>; my %taxR; if($taxon){ while(<tax_file>){ foreach my $tempTax (@taxList){ if($_ =~ m/$tempTax/i){ my @tempTax = split(/\t/, $_); $taxR{$tempTax[1]} = 1; } } } } close tax_file;

      I get the same results as last time. I also had the opportunity to test it on a unix machine and the program works fine on that machine.

      As far as use warnings goes, I need to do a lot of editing or parsing because my program relies heavily on uninitialized values counting as false, so I'm going to work on that now.

        BTW, in Perl, you can just use the forward-slash in paths, and it'll work just fine in Windows. No need for that silly $slash variable.

Re: Breaks on Mac but not Windows or Linux - huge IO
by vivomancer (Initiate) on Jun 28, 2012 at 16:53 UTC
    <pp>I received a fix to my problem here http://stackoverflow.com/questions/11245797/perl-large-io-bug-on-mac-but-not-windows-or-linux-adds-newline-cant-be-chomped/11246092#11246092

    The problem was possibly that formatting the $tempFile in windows caused a hiddencharacter to be added to the newlines that was not removed by chomp on a mac.