storable: too slow retrieval of big file.

pc2 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: storable: too slow retrieval of big file. by Corion (Patriarch) on Jul 18, 2007 at 14:54 UTC
You can trade time spent to load the hash into memory against time spent accessing an element by using a tied hash like DB_File, or, if your hash is basically constant, CDB_File. But that will slow down every access, so you will have to benchmark whether all your hash accesses will sum up to the 7 seconds load time. You could also investigate whether simply storing the hash as Perl code via Data::Dumper and loading it via do will be faster (I doubt it), or whether using/writing a daemon like Memcached improves things.	[reply]
Re: storable: too slow retrieval of big file. by Fletch (Bishop) on Jul 18, 2007 at 14:55 UTC
Sidestep the problem. Perhaps you should look into leaving the hash on disk (e.g. use BerkeleyDB, or possibly step up to a full RDBMS and DBI) rather than loading it into memory?	[reply]
Re: storable: too slow retrieval of big file. by BrowserUk (Patriarch) on Jul 18, 2007 at 15:21 UTC
The only real way to speed up the load is to load less data at a time. You could go the route of breaking the files by the first letter (or paior of letters) as I suggested in Re: reading dictionary file -> morphological analyser. except this time using Storable. That said, using Storable doesn't buy you much in performance for text data in a single level hash. The memory still has to be allocated and the hash built. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re: storable: too slow retrieval of big file. by dave_the_m (Monsignor) on Jul 18, 2007 at 15:40 UTC
You could nearly halve the time by not dereferencing the hash (which is causing the entire hash to be copied), ie `$dict = retrieve("hash.txt");` [download] Dave.	[reply] [d/l]
Re: storable: too slow retrieval of big file. by almut (Canon) on Jul 18, 2007 at 16:57 UTC
Just in case you want to stick with the hash approach (the DB approach and the other suggestions would be fine, too), you could also speed up things by avoiding to reload the hash every time. That's essentially the client/server approach I was hinting at in Re: reading dictionary file -> morphological analyser. I.e., you split the program into a server and a client. The server loads the hash once and keeps running, providing a 'dict' service via some socket. The client just connects to the socket, sends the request (the word to be looked up), and reads the server's reply. Here's a minimal example, just to get you started. (This is not production quality code, and could be improved in many ways (e.g. by making a proper daemon out of it, better error handling, etc.), but I wanted to keep it simple...) The server: use IO::Socket; use Storable; my $dict = retrieve("hash.txt"); my $sock = IO::Socket::INET->new( LocalAddr => "localhost:8888", # any available port you like ReuseAddr => 1, Listen => 2, ) or die "$0: can't create listening socket: $!\n"; while (1) { my $conn = $sock->accept(); # wait for connection next unless ref $conn; # (just in case...) my $query = <$conn>; chomp $query; print STDERR "query: $query\n"; # just for debugging my $reply = (exists $dict->{$query}) ? "FOUND\n" : "NOT FOUND\n"; print $conn $reply; close $conn; } [download] The client: `use IO::Socket; sub connect_server { my $sock = IO::Socket::INET->new( PeerAddr => "localhost:8888", ) or die "$0: can't connect: $!\n"; return $sock; } my @inputs = qw( foo fooed fooen ); for my $input (@inputs) { my $conn = connect_server(); print $conn "$input\n"; # send the query my $reply = <$conn>; # read the response close $conn; print "found '$input' in lexicon\n" if $reply eq "FOUND\n"; }` [download] Even though this code is opening a new connection for every word being looked up, it's still pretty fast (around 2 ms per lookup, on my machine). (To play with the example, start the server in one terminal (as I mentioned, it's not a daemon, so it will stay in the foreground), and then run the client from another terminal... )	[reply] [d/l] [select]
Re: storable: too slow retrieval of big file. by wfsp (Abbot) on Jul 18, 2007 at 15:09 UTC
Could be a job for DBM::Deep? Certainly worth a look.	[reply]
Re: storable: too slow retrieval of big file. by pc2 (Beadle) on Jul 20, 2007 at 23:36 UTC
salutations, we solved the problem by switching to a database-based solution, using BerkeleyDB. thus, we convert the hash TXT file to BerkeleyDB using the command-line program: `db_load -c duplicates=1 -T -t hash -f dict.txt dict.db` [download] which converts "dict.txt" (with keys and values separated by a newline character, and each pair of lines being a record) to the BerkeleyDB database "dict.db", allowing duplicate keys. this solution turned out to work great. thank you for all the help.	[reply] [d/l]