pc2 has asked for the wisdom of the Perl Monks concerning the following question:

salutations, we have a piece of code that uses the "Storable" module for storing a large hash in a file "hash.txt". then, another code retrieves this file into a hash, using the "retrieve" code. look (the retrieval part):
#!c:/perl/bin/perl use Storable; %dict = %{retrieve("hash.txt")}; #retrieves the file into the hash %di +ct.
the file "hash.txt" is very large (28 MB), whose hash contains 385090 pairs of keys and values. thus, the "retrieve" command takes about 7 seconds to load the file. when we tested a "hash.txt" file with 58 MB, it took even longer (nearly 16 seconds) to load the hash from the file. of course, after loading the file, obtaining a value from the hash is very fast; the problem is the loading of the hash from "hash.txt". does anyone have some idea to speed up the loading of the hash? thank you in advance.

Replies are listed 'Best First'.
Re: storable: too slow retrieval of big file.
by Corion (Patriarch) on Jul 18, 2007 at 14:54 UTC

    You can trade time spent to load the hash into memory against time spent accessing an element by using a tied hash like DB_File, or, if your hash is basically constant, CDB_File. But that will slow down every access, so you will have to benchmark whether all your hash accesses will sum up to the 7 seconds load time. You could also investigate whether simply storing the hash as Perl code via Data::Dumper and loading it via do will be faster (I doubt it), or whether using/writing a daemon like Memcached improves things.

Re: storable: too slow retrieval of big file.
by Fletch (Bishop) on Jul 18, 2007 at 14:55 UTC

    Sidestep the problem. Perhaps you should look into leaving the hash on disk (e.g. use BerkeleyDB, or possibly step up to a full RDBMS and DBI) rather than loading it into memory?

Re: storable: too slow retrieval of big file.
by BrowserUk (Patriarch) on Jul 18, 2007 at 15:21 UTC
Re: storable: too slow retrieval of big file.
by dave_the_m (Monsignor) on Jul 18, 2007 at 15:40 UTC
    You could nearly halve the time by not dereferencing the hash (which is causing the entire hash to be copied), ie
    $dict = retrieve("hash.txt");

    Dave.

Re: storable: too slow retrieval of big file.
by almut (Canon) on Jul 18, 2007 at 16:57 UTC

    Just in case you want to stick with the hash approach (the DB approach and the other suggestions would be fine, too), you could also speed up things by avoiding to reload the hash every time. That's essentially the client/server approach I was hinting at in Re: reading dictionary file -> morphological analyser. I.e., you split the program into a server and a client. The server loads the hash once and keeps running, providing a 'dict' service via some socket. The client just connects to the socket, sends the request (the word to be looked up), and reads the server's reply.

    Here's a minimal example, just to get you started. (This is not production quality code, and could be improved in many ways (e.g. by making a proper daemon out of it, better error handling, etc.), but I wanted to keep it simple...)

    The server:

    use IO::Socket; use Storable; my $dict = retrieve("hash.txt"); my $sock = IO::Socket::INET->new( LocalAddr => "localhost:8888", # any available port you like ReuseAddr => 1, Listen => 2, ) or die "$0: can't create listening socket: $!\n"; while (1) { my $conn = $sock->accept(); # wait for connection next unless ref $conn; # (just in case...) my $query = <$conn>; chomp $query; print STDERR "query: $query\n"; # just for debugging my $reply = (exists $dict->{$query}) ? "FOUND\n" : "NOT FOUND\n"; print $conn $reply; close $conn; }

    The client:

    use IO::Socket; sub connect_server { my $sock = IO::Socket::INET->new( PeerAddr => "localhost:8888", ) or die "$0: can't connect: $!\n"; return $sock; } my @inputs = qw( foo fooed fooen ); for my $input (@inputs) { my $conn = connect_server(); print $conn "$input\n"; # send the query my $reply = <$conn>; # read the response close $conn; print "found '$input' in lexicon\n" if $reply eq "FOUND\n"; }

    Even though this code is opening a new connection for every word being looked up, it's still pretty fast (around 2 ms per lookup, on my machine).

    (To play with the example, start the server in one terminal (as I mentioned, it's not a daemon, so it will stay in the foreground), and then run the client from another terminal... )

Re: storable: too slow retrieval of big file.
by wfsp (Abbot) on Jul 18, 2007 at 15:09 UTC
    Could be a job for DBM::Deep? Certainly worth a look.
Re: storable: too slow retrieval of big file.
by pc2 (Beadle) on Jul 20, 2007 at 23:36 UTC
    salutations, we solved the problem by switching to a database-based solution, using BerkeleyDB. thus, we convert the hash TXT file to BerkeleyDB using the command-line program:
    db_load -c duplicates=1 -T -t hash -f dict.txt dict.db
    which converts "dict.txt" (with keys and values separated by a newline character, and each pair of lines being a record) to the BerkeleyDB database "dict.db", allowing duplicate keys. this solution turned out to work great. thank you for all the help.