in reply to Re^4: tying a hash from a big dictionary
in thread tying a hash from a big dictionary

The addition of the following 3 lines should tell you with sufficient accuracy after a single run:

sub read_dict{ local $| = 1; ##! my $file = shift; my %dict; open( my $fh, "<:encoding(utf5)", $file ); my $c = 0; ##! while( <FILE> ) { printf "\r%d\t", $c unless ++$c % 1000; ##! chomp; ## no need to chomp twice my ($p1, $p2) = split /\t/; push( @{ $dict{ $p1 } }, $p2 ); } close $fh; return \%dict; ## main space saving change; return a ref to the ha +sh }

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^6: tying a hash from a big dictionary
by Anonymous Monk on Oct 31, 2011 at 14:54 UTC
    running on a 4 gb machine, it will run out of memory after about 5m entries!

      Ouch. They must be some long phrases. Even if you're on a 32-bit OS whereby only 2GB or 3GB of that memory is available to the process.

      You're definitely going to need to use external storage.

      I used BerkeleyDB with some success for a bit under 32-bit Perl. Though it can take quite a while for the initial building of a large DB, once built the access/retrieval times are about as good as I've ever seen for a disk-based system. You do need to pay some attention to the various configuration parameters to get the best out of it. Look for the BerkeleyDB tuning guide on-line as the module pod is pretty light on tuning.

      If you find yourself up against it performance-wise then the object interface is marginally quicker than the tied interface, but much less nice to use. If it is a reference-only DB, sticking (the pre-built) DB file on a cheap SSD, or even a fast thumbdrive, can do wonders for access times.

      Just wish I could get it to build for my 64-bit system :(


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.