in reply to Re^3: tying a hash from a big dictionary
in thread tying a hash from a big dictionary

I have around 200m lines. I don't know after how many lines I go out of memory since I haven't calculated yet.
  • Comment on Re^4: tying a hash from a big dictionary

Replies are listed 'Best First'.
Re^5: tying a hash from a big dictionary
by BrowserUk (Patriarch) on Oct 31, 2011 at 14:10 UTC

    The addition of the following 3 lines should tell you with sufficient accuracy after a single run:

    sub read_dict{ local $| = 1; ##! my $file = shift; my %dict; open( my $fh, "<:encoding(utf5)", $file ); my $c = 0; ##! while( <FILE> ) { printf "\r%d\t", $c unless ++$c % 1000; ##! chomp; ## no need to chomp twice my ($p1, $p2) = split /\t/; push( @{ $dict{ $p1 } }, $p2 ); } close $fh; return \%dict; ## main space saving change; return a ref to the ha +sh }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      running on a 4 gb machine, it will run out of memory after about 5m entries!

        Ouch. They must be some long phrases. Even if you're on a 32-bit OS whereby only 2GB or 3GB of that memory is available to the process.

        You're definitely going to need to use external storage.

        I used BerkeleyDB with some success for a bit under 32-bit Perl. Though it can take quite a while for the initial building of a large DB, once built the access/retrieval times are about as good as I've ever seen for a disk-based system. You do need to pay some attention to the various configuration parameters to get the best out of it. Look for the BerkeleyDB tuning guide on-line as the module pod is pretty light on tuning.

        If you find yourself up against it performance-wise then the object interface is marginally quicker than the tied interface, but much less nice to use. If it is a reference-only DB, sticking (the pre-built) DB file on a cheap SSD, or even a fast thumbdrive, can do wonders for access times.

        Just wish I could get it to build for my 64-bit system :(


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.