Re: Large file, multi dimensional hash

You could reduce the memory requirement to around 1/4 by not using 2 levels of hash. A single level will do the job:

use strict;
use warnings;

open(my $fh, "<", "input.txt") 
   or die "cannot open < input.txt: $!";

my %duplicates;
while (my $line = <$fh>) {
    chomp $line;
    ++$duplicates{$line};
}
[download]

But that will still require around 4GB to build the 50e6 key hash. Better than 16GB, but you will still run out of memory if you are using a 32-bit Perl (unless you have a very high proportion of duplicates. eg. >50%)

As you say the data is presorted, investigate the uniq command

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: Large file, multi dimensional hash - out of memory Download Code

Replies are listed 'Best First'.
Re^2: Large file, multi dimensional hash - out of memory by Anonymous Monk on May 15, 2013 at 14:53 UTC
thanks, it is a 64bit perl, but not long after passing the 4GB mark it runs out of memory anyway. I guess we need a bigger boat	[reply]
Re^3: Large file, multi dimensional hash - out of memory by BrowserUk (Patriarch) on May 15, 2013 at 15:10 UTC
As the input file is sorted, `uniq infile > outfile` ought to do the job very quickly. If for some reason you don't have the uniq command available, try `#! perl -w use strict; my $last = <>; while( <> ) { print $last if $_ ne $last; $last = $_; } __END__ thisscript infile > outfile` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]