Breaks on Mac but not Windows or Linux

vivomancer has asked for the wisdom of the Perl Monks concerning the following question:

I added huge IO in the title because I was unable to replicate the break with a 20,000 line test file. My program has worked on multiple windows machines but all of them have perl v5.12.3 or higher. My program has failed on 5 Macs with perl versions ranging from 5.8.x to 5.10.0 When I make a much smaller sample file, it has worked on a perlv5.10.0 Mac.

There are two primary files my program reads in. They are altered FASTA files (UniProtKB/Swiss-Prot and UniProtKB/TrEMBL) from this website http://www.uniprot.org/downloads These files are ~250MB and ~10GB respectively. I altered them to have each sequence on 1 line with the format "$annotativeInformation\t$aminoacidSequence\t$taxonomyCode\n". The following is an example of one line of the altered data file. >sp|P48255|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1 MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK CYAPA The taxonomy code is at the end of the first part of the annotation info directly following an underscore. I added it extra at the end to make it easier to pull out when my program was reading through a million lines.

My problem started when I added a taxonomy filter. Basically a user would specify a species name "Cyanophora" and my program would parse through the taxonomy datafile which I downloaded in tab-delimited format, here is an example

 Taxon    Mnemonic    Scientific name    Common name    Synonym    Oth
+er Names    Reviewed    Rank    Lineage    Parent
43989    CYAA5    Cyanothece sp. (strain ATCC 51142)            Cyanot
+hece (strain ATCC 51142); Cyanothece 51142; Cyanothece ATCC51142; Cya
+nothece sp. ATCC 51142; Cyanothece sp. BH68; Cyanothece sp. BH68K    
+reviewed    Species    Bacteria; Cyanobacteria; Chroococcales; Cyanot
+hece    43988
[download]

Basically if whatever the user put in quotes up there matched anything found on a line in the taxonomy file, I would take the second item of the list and add it to an array. If the taxonomy code of the current line existed in the array, I would then send it on to the rest of the program

here is the relevant code

my $slash = "/";
if("$^O" eq "MSWin32"){
    $slash = "\\";
}
# restrict the search to a specific taxonomy
my $taxon = $ARGV[3];
$annotation .= "\t$taxon";
open(tax_file, "..".$slash."dataset".$slash."taxonomy.tab") or die "co
+uldn't open taxonomy.tab";
my @taxR = <tax_file>;
close tax_file;
if($taxon){
    @taxR = grep { /$taxon/i } @taxR;
    for(my $e = 0; $e < scalar(@taxR); $e++){
        my @taxRR = split(/\t/, $taxR[$e]);
        $taxR[$e] = $taxRR[1];
    }
}
my %taxR = map { $_ => 1 } @taxR;
print "cyaa5 = ".$taxR{"CYAA5"};#prints cyaa5 = 1
#skipping a bunch of unrelated stuff
    open(ps_file, "..".$slash."dataset".$slash.$tempFile) or die "coul
+dn't open $tempFile";
    while(<ps_file>){
        chomp;
        
        
        my @curLine = split(/\t/, $_);
        my $filter = 1;
        if($taxon){
            print "$curLine[2]\t$taxR{$curLine[2]}\n";#produced weird 
+output with when run with the huge protein file will post below
                        #these commented out lines are previous attemp
+ts that work on windows but not Mac
            #$filter = $curLine[2] ~~ @taxR;
            #$filter = scalar(grep( /^$curLine[2]$/, @taxR ));
            #$filter = ( first { $_ eq $curLine[2] } @taxR );
            #print $taxR{curLine[2]}."\n";
            $filter = $taxR{$curLine[2]};
        }
        if($filter){
            checkSeq(@curLine);
        }
    }
    close ps_file;
[download]

sample output

from print in while loop on mac with massive $tempFile, about one million lines of it

FRG3G
    
FRG3G
    
IIV3
    
IIV3
    
FRG3G
    
FRG3G
    
IIV3
    
FRG3G
    
IIV6
    
FRG3G
    
FRG3G
[download]

sample output from print in loop on windows with massive $tempFile

GLOVI    1
GRATL    
PORPU    
PORYE    
PROM0    1
PROM2    1
PROM3    1
PROM4    1
PROM5    1
PROM9    1
[download]

sample output from print in loop on the mac with a small $tempFile

ECOSM    
ECOUT    
ENT38    
ERWCT    
ESCF3    
PECCP    
CYAP4    1
DEIRA    
DELAS    
DESAP    
DESHY
[download]

Comment on Breaks on Mac but not Windows or Linux - huge IO Select or Download Code

Replies are listed 'Best First'.
Re: Breaks on Mac but not Windows - huge IO by frozenwithjoy (Priest) on Jun 27, 2012 at 02:49 UTC
How much RAM do the different machines have? Is the file that causes problems when it gets large the taxonomy file used here? `my @taxR = <tax_file>;` [download] If so, your problem may be that you are reading the entire file into memory at once (because each line is now an element of your array) and your machine is probably running out of memory. When you say the script fails, what exactly do you mean? Is there an error or? Does the process get killed by OOM Killer? EDIT: I re-read your question some more and realized that whatever file `$tempFile` is is the one that causes problems as it gets too large. Is that correct? What does that file look like? Also, is a non-zero value or string always assigned during `$filter = $taxR{$curLine[2]};`? If so, I'm not sure I understand the if-conditional for `checkSeq(@curLine);`. What is `checkSeq` doing? What happens if you run it on the Windows machine, but include `use 5.10;`? Just out of curiosity, you have `use strict use warnings;` and there are no errors, right?	[reply] [d/l] [select]
Re^2: Breaks on Mac but not Windows - huge IO by vivomancer (Initiate) on Jun 27, 2012 at 03:59 UTC
I'm about to go to sleep so I won't be able to run some of your tests until the morning taxonomy.tab is 800kb, the smaller $tempFile is 250,000kb, the larger is 10gb. When I use a 23,000kb $tempFile, the program completes sucessfully on both a Mac and PC The problem does occur when $tempFile gets too large. $tempfile is format "$annotativeInformation\t$aminoacidSequence\t$taxonomyCode\n". The following is an example of one line of $tempFile. >sp\|P48255\|ABCX_CYAPA Probable ATP-dependent transporter ycf16 OS=Cyanophora paradoxa GN=ycf16 PE=3 SV=1 MSTEKTKILEVKNLKAQVDGTEILKGVNLTINSGEIHAIMGPNGSGKSTFSKILAGHPAYQVTGGEILFKNKNLLELEPEERARAGVFLAFQYPIEIAGVSNIDFLRLAYNNRRKEEGLTELDPLTFYSIVKEKLNVVKMDPHFLNRNVNEGFSGGEKKRNEILQMALLNPSLAILDETDSGLDIDALRIVAEGVNQLSNKENSIILITHYQRLLDYIVPDYIHVMQNGRILKTGGAELAKELEIKGYDWLNELEMVKK CYAPA What I mean by the script failing is that, the hash %taxR is messed up, as shown in the 3 examples of output (2 good, 1 bad) which causes the program to not forward any lines of $tempFile to the rest of the program. There is no error. For the program to progress, some of the hash values must equal 1 $filter is true if there is a value for $taxR{$curLineΐ]} which is built near the top checkSeq(@curLine) is the rest of my program which works no matter where I test it, if I set $filter to be equal to 1, the program doesn't work because $filter is never set to 1 because the hash seems to break when $tempFile is too large I use strict but I haven't used warning, I'll have to check that	[reply]
Re^2: Breaks on Mac but not Windows - huge IO by vivomancer (Initiate) on Jun 27, 2012 at 16:49 UTC
I changed the way it reads in tax_file to this `my $taxon = $ARGV[3]; unless($taxon){ $taxon = "";#default is blank } $annotation .= "\t$taxon"; my @taxList = split(/\\|/, $taxon); open(tax_file, "..".$slash."dataset".$slash."taxonomy.tab") or die "co +uldn't open taxonomy.tab"; #my @taxR = <tax_file>; my %taxR; if($taxon){ while(<tax_file>){ foreach my $tempTax (@taxList){ if($_ =~ m/$tempTax/i){ my @tempTax = split(/\t/, $_); $taxR{$tempTax[1]} = 1; } } } } close tax_file;` [download] I get the same results as last time. I also had the opportunity to test it on a unix machine and the program works fine on that machine. As far as use warnings goes, I need to do a lot of editing or parsing because my program relies heavily on uninitialized values counting as false, so I'm going to work on that now.	[reply] [d/l]
Re^3: Breaks on Mac but not Windows - huge IO by Anonymous Monk on Jun 28, 2012 at 07:22 UTC
BTW, in Perl, you can just use the forward-slash in paths, and it'll work just fine in Windows. No need for that silly `$slash` variable.	[reply] [d/l]
Re: Breaks on Mac but not Windows or Linux - huge IO by vivomancer (Initiate) on Jun 28, 2012 at 16:53 UTC
<pp>I received a fix to my problem here http://stackoverflow.com/questions/11245797/perl-large-io-bug-on-mac-but-not-windows-or-linux-adds-newline-cant-be-chomped/11246092#11246092 The problem was possibly that formatting the $tempFile in windows caused a hiddencharacter to be added to the newlines that was not removed by chomp on a mac.	[reply]