Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Error when running on larger files

by K_Edw (Beadle)
on Jun 28, 2016 at 08:36 UTC ( [id://1166729]=perlquestion: print w/replies, xml ) Need Help??

K_Edw has asked for the wisdom of the Perl Monks concerning the following question:

I have a small snippet of code which processes a tab-delimited .txt file 2 lines at a time:

while (<$IN2>) { chomp $_; next if eof; my @F2 = split( "\t", $_ ); #Split each tab-delimite +d field my $partner = <$IN2>; my @F3 = split( "\t", $partner ); #Split each tab-delimite +d field $store{ ( abs( $F2[2] - $F3[2] ) + 1 ) }++; ( $F2[2], $F3[2] ) = ( $F3[2], $F2[2] ) if $F2[2] > $F3[2]; $Tally{ $F2[1] }{ $F2[2] }{ $F3[2] }++; } foreach my $key ( sort { $a <=> $b } keys %store ) { print $OUT4 "$key\t$store{$key}\n"; } foreach my $chr ( sort { $a <=> $b } keys %Tally ) { foreach my $value1 ( sort { $a <=> $b } keys %{ $Tally{$chr} } ) { foreach my $value2 ( sort { $a <=> $b } keys %{ $Tally{$chr}{$value1 +} } ) { print $OUT5 "$chr\t$value1\t$value2\t$Tally{$chr}{$value1}{$value2}\ +n"; } } }

When attempting to run this on larger .txt files (>4,000,000 lines), I receive the following errors often near the end of the file but >50 lines from it):

Use of uninitialized value in subtraction (-) at line 117, <$IN2> line + 4148567. Use of uninitialized value in numeric gt (>) at line 118, <$IN2> line +4148567. Use of uninitialized value $F2[2] in hash element at line 119, <$IN2> +line 4148567. Argument "" isn't numeric in sort at line 127, <$IN2> line 4148567.
Line 117 - $store{(abs($F2[2]-$F3[2])+1)}++; Line 118 - ($F2[2], $F3[2]) = ($F3[2], $F2[2]) if $F2[2] > $F3[2]; Line 119 - $Tally{$F2[1]}{$F2[2]}{$F3[2]}++; Line 127 - foreach my $value1 (sort {$a <=> $b} keys %{$Tally{$chr}}) +{

Printing $. confirms that the script simply terminates at this input file line and no further lines are read in. If the input file is sorted, the error occurs approximately in the same place but on a different line of content. There is nothing obviously wrong with the content of the file and all lines match expectations.

However, if I split the input file into two halves - the script runs to completion without error.

Am I hitting some sort of memory or hash limit? Is there a way to fix this without having to split the input file before processing? This was run on Perl 5.25.2 but also occurs on 5.24.0.

The format of the input file is as such:

w 11 99658 75 75M 0 c 11 99999 75 75M 74 w 2 702424 75 75M 0 c 2 702556 75 75M 74 c 13 82486 75 75M 74 w 13 82171 75 75M 0 c 2 702585 75 75M 74 w 2 702390 75 75M 0 c 18 2529 75 75M 74 w 18 2232 75 75M 0 c 12 264648 75 75M 74 w 12 264366 74 74M 0 c 10 177758 75 75M 74 w 10 177438 74 74M 0 w 7 185488 74 74M 0

Replies are listed 'Best First'.
Re: Error when running on larger files
by BrowserUk (Patriarch) on Jun 28, 2016 at 08:49 UTC
    Am I hitting some sort of memory or hash limit?

    If you were hitting a memory limit, I would not expect the errors you are seeing.

    How long are the lines in the file? (A small representative sample would be good.)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.
      Updated OP with an example of the format. I too would not expect such errors however I cannot think of an alternative explanation as it does not appear to be caused by any specific line or error in the content. After sorting or randomly shuffling the order of the lines within the input file, the error still occurs somewhere toward the end of the file.

        I just generated a 5 million line file that approximates your format:

        #! perl -slw use strict; for( 1 .. 5e6 ) { my $n1 = int( rand 100 ); my $n2 = int( rand 1e6 ); printf "w\t%d\t%d\t75\t75M\t0\n", $n1, $n2; printf "c\t%d\t%d\t75\t75M\t0\n", $n1, $n2 - int( rand 1000 ) + 50 +0; } __END__ C:\test>head 1166792.dat w 70 437286 75 75M 0 c 70 437579 75 75M 0 w 50 852386 75 75M 0 c 50 852473 75 75M 0 w 45 45196 75 75M 0 c 45 45695 75 75M 0 w 83 1739 75 75M 0 c 83 1590 75 75M 0 w 31 838500 75 75M 0 c 31 838902 75 75M 0

        And wrapped your posted snippet up to allow it to run:

        #! perl -slw use strict; open my $IN2, '<', '1166792.dat' or die $!; my( %store, %Tally ); while (<$IN2>) { chomp $_; next if eof; my @F2 = split( "\t", $_ ); #Split each tab-delimite +d field my $partner = <$IN2>; my @F3 = split( "\t", $partner ); #Split each tab-delimite +d field $store{ ( abs( $F2[2] - $F3[2] ) + 1 ) }++; ( $F2[2], $F3[2] ) = ( $F3[2], $F2[2] ) if $F2[2] > $F3[2]; $Tally{ $F2[1] }{ $F2[2] }{ $F3[2] }++; } foreach my $key ( sort { $a <=> $b } keys %store ) { print "$key\t$store{$key}\n"; } foreach my $chr ( sort { $a <=> $b } keys %Tally ) { foreach my $value1 ( sort { $a <=> $b } keys %{ $Tally{$chr} } ) { foreach my $value2 ( sort { $a <=> $b } keys %{ $Tally{$chr}{$ +value1} } ) { print "$chr\t$value1\t$value2\t$Tally{$chr}{$value1}{$valu +e2}\n"; } } }

        And it runs to completion under 5.22 using just under 1.2GB.

        Unless you're on a very memory constrained system, memory isn't the problem. Do you have somewhere you can post your failing datafile (zipped.)?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

        Ignore this: That was harangzsolt33.

        You're not the guy I saw mention last week some time he was using tinyperl?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice. Not understood.

        Use of uninitialized value in subtraction (-) at line 117, <$IN2> line + 4148567. Use of uninitialized value in numeric gt (>) at line 118, <$IN2> line +4148567. Use of uninitialized value $F2[2] in hash element at line 119, <$IN2> +line 4148567. Argument "" isn't numeric in sort at line 127, <$IN2> line 4148567.

        Instead of randomly reshuffling the file, could you not extract that 1 line (head -4148567 <file> | tail -1) and then run your script against that 1 line? If I head to guess, seeing that $F[2] is the culprit, I'm guessing you have a double tab on that particular line.

        update:Nevermind ... I shouldn't answer questions without first reading the whole thing and second not having finished my morning cup of coffee.

        -derby
Re: Error when running on larger files
by Cow1337killr (Monk) on Jun 28, 2016 at 09:10 UTC

    You are hitting the limit on something.

    You should be able to pinpoint the record number just prior to these errors occurring and have some print statements occur so that you know whether the values of $F2[2] and the other variables are reasonable and then you can watch one or more of them suddenly go haywire.

    The good thing is that the error is reproducible.

    By the way, "Mastering Algorithms with Perl" has a detailed chapter on every sort algorithm available in Perl (at the time it was printed).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1166729]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (3)
As of 2024-04-20 07:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found