in reply to Large file, multi dimensional hash - out of memory

You could reduce the memory requirement to around 1/4 by not using 2 levels of hash. A single level will do the job:

use strict; use warnings; open(my $fh, "<", "input.txt") or die "cannot open < input.txt: $!"; my %duplicates; while (my $line = <$fh>) { chomp $line; ++$duplicates{$line}; }

But that will still require around 4GB to build the 50e6 key hash. Better than 16GB, but you will still run out of memory if you are using a 32-bit Perl (unless you have a very high proportion of duplicates. eg. >50%)

As you say the data is presorted, investigate the uniq command


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Large file, multi dimensional hash - out of memory
by Anonymous Monk on May 15, 2013 at 14:53 UTC
    thanks, it is a 64bit perl, but not long after passing the 4GB mark it runs out of memory anyway. I guess we need a bigger boat

      As the input file is sorted, uniq infile > outfile ought to do the job very quickly.

      If for some reason you don't have the uniq command available, try

      #! perl -w use strict; my $last = <>; while( <> ) { print $last if $_ ne $last; $last = $_; } __END__ thisscript infile > outfile

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.