gunr has asked for the wisdom of the Perl Monks concerning the following question:

I have two main questions. 1) How fast/slow is a hash lookup for a hash that has more than 1 million entries and 2) Is there a better way to store the necessary information (more info on that below) than a hash? Here is the code of interest: (the DBI module is used with MySQL)
my $sth = $dbh->prepare("SELECT qseqid FROM AllJoinRecip"); $sth->execute() or die("execute failed " . $sth->errstr()); my $i = 1; while(my $seq = $sth->fetchrow()) { next if(exists $processed{$seq}); my $out = `./friendoffriend.pl "$seq"`; my @results = split(',', $out); $processed{$_} = 1 foreach (@results); print "Cluster $i: $out\n"; $i++; }
The output of friendoffriend.pl is a comma separated list of strings. Basically, if I've already found the data in a previous run of friendoffriend.pl, I do not want to run friendoffriend.pl on that data. The MySQL table AllJoinRecip contains more than 1 million records. So, I need an efficient way to store and search the data I have already found so that I do not run friendoffriend.pl on that data. As you can see, I have been using a hash then looking up whether the key exists in the hash. Running this program took about 3 hours 30 minutes to process all of the data. friendoffriend.pl runs very quickly, so I'm wondering if using a hash to store the information is slowing down the program. So, will this large hash significantly impact execution time? If so, what are some alternatives that could speed up my execution time? I have MySQL at my disposal, so feel free to offer suggestions that utilize MySQL. I can also modify the AllJoinRecip table as necessary, as this table can very easily be recreated (it takes about 5 minutes to create, though). Any other optimization suggestions are greatly appreciated. Thanks! -gunr

Replies are listed 'Best First'.
Re: Efficiency of a Hash with 1 Million Entries
by ikegami (Patriarch) on Jul 01, 2010 at 22:00 UTC

    friendoffriend.pl runs very quickly

    Yes, but it's executed 1,000,000 times.

    Just loading perl takes 3ms

    $ time perl -e1 real 0m0.003s user 0m0.004s sys 0m0.000s

    Loading a basic module triples that.

    $ time perl -e'use Scalar::Util;' real 0m0.009s user 0m0.004s sys 0m0.004s

    For every record, you do the following actions that could be avoided if you rewrote friendoffriend.pl as a module:

    • Launch a new shell
    • Launch a new instance of perl
    • Recompiling friendoffriend.pl
    • Serialize and output the result of friendoffriend.pl
    • Read and parse the result of friendoffriend.pl

    If that adds up to 12ms, you'd save 3 hours and 20 minutes (12ms * 1,000,000) just by converting friendoffriend.pl into a module.

    So, will this large hash significantly impact execution time?

    It shouldn't. Assuming all memory access costs the same, inserting into and fetching from a Perl hash takes O(1) time. In other words, the time to do those actions does not increase as the size of the hash increases.

    If you use enough memory to require swapping, that would break the assumption.

    Update: Reorganised to be clearer. Added actual numbers.

Re: Efficiency of a Hash with 1 Million Entries
by BrowserUk (Patriarch) on Jul 01, 2010 at 21:50 UTC

    The first suggestion I have is that you make friendoffriend.pl into a module and call it as a subroutine rather than starting one million new copies of perl. That would probably save about half of your runtime.

Re: Efficiency of a Hash with 1 Million Entries
by jethro (Monsignor) on Jul 01, 2010 at 22:19 UTC

    1) That's the beauty of a hash, a hash lookup will be the same speed, whether 10 or 1 million elements are in it

    2) The hash is fast as long as you have enough real memory for it (i.e. don't have to swap out memory to disk).

    A waste of time is calling friendoffriend.pl for each line. If you put that code inside the script you save a lot of cpu cycles spent on reserving memory, starting a new perl interpreter, reading the script, doing some inits and throwing all away after the friendoffriend.pl script stops thousands of times

Re: Efficiency of a Hash with 1 Million Entries
by ssandv (Hermit) on Jul 01, 2010 at 21:49 UTC

    You might try just having it fetch (and maybe print) $seq for every row. That'd give you a good idea of how much of your time is spent in hash lookup and storage, and row processing. Personally, I suspect that fetching a million rows one at a time might be slow. Alternately, you could consider profiling your code using Devel::NYTProf or something similar. Hash random access is pretty fast by design, but doing *anything* a million times can slow things down.

Re: Efficiency of a Hash with 1 Million Entries
by dineed (Scribe) on Jul 02, 2010 at 02:58 UTC

    While others (BrowserUK, Ikegami, and Jethro) have offered excellent advice about correcting issues related to friendoffriend.pl, I'm more curious about the query. If you are getting what appear to be duplicate results for $seq, maybe you can include "SELECT DISTINCT" in the query to reduce the output row count.

Re: Efficiency of a Hash with 1 Million Entries
by dineed (Scribe) on Jul 02, 2010 at 18:09 UTC

    Some more about your query. I just noticed that your table name is "AllJoinRecip", leading me to assume this table is a view (or result of some other query) joining one or more tables/views. Things you can do to improve run/build time:

    • If you can, make sure your query to build "AllJoinRecip" includes only columns you actually need in the result set
    • Be careful to avoid a "Cartesian Product". This is a query that returns every possible combination of every row from every table in the query. You can identify this if your result set contains more rows than all of the tables/views in your "From" clause combined.
    • In my earlier response, I suggested using "SELECT DISTINCT" in your query, but I was speaking about the only query to which you provided visibility. Maybe you can use the "SELECT DISTINCT" to build "AllJoinRecip" to keep your source data set as small as possible coming into the Perl script.

    I hope this helps.

Re: Efficiency of a Hash with 1 Million Entries
by gunr (Novice) on Jul 02, 2010 at 22:11 UTC

    Thanks again for the help. By turning friendoffriend.pl into a module, the execution time is now about 2 minutes! I had no idea that running another script through a system call would cause such a slow down. If anyone is looking for a really simple and easy to understand tutorial on making a Perl module, here is the site I used: http://www.webreference.com/programming/perl/modules/ The only thing now, though, is the results that I was getting before differ from the results I am getting now (about 224,000 clusters previously, now only 124,000 clusters). So I will have to go back through my code and see what could have caused this change.

    In regard to the AllJoinRecip table, I must have incorrectly remembered the amount of time it takes to make that table. It actually only takes 10 seconds. I have no idea which table I remember taking 5 minutes to create. I am pretty sure that the AllJoinRecip table follows the suggestions you made, dineed, but I will double check to make sure. Thanks again!

      Note that since the conversion from an external program to a module within your other program, global variables of friendoffriend.pl retain their value between runs. You might want to reinitialize them upon each run.

        Thanks for pointing this out.

        For those keeping score: It turns out that my original results file had some duplicate entries somehow. I must have messed up somewhere in the script I guess. The new results do not have any duplicates, so it is a much more accurate representation of my data.

Re: Efficiency of a Hash with 1 Million Entries
by gunr (Novice) on Jul 02, 2010 at 15:27 UTC
    Thanks for the help, everyone. I am currently reading about how to turn my friendoffriend.pl script into a module. I want to keep this script as a separate file, rather than making it a subroutine within the calling program, so that I can use frinedoffriend.pl on its own. If anyone know of a good tutorial for converting perl scripts in a module, please let me know. In regards to the query: friendoffriend.pl is used to cluster data in the AllJoinRecip table, so given an input $seq, friendoffriend.pl finds all of the other "nodes" that $seq can reach, either directly or indirectly. So once I've run the script on seqA and found that seqA, seqB, and seqC are all in a cluster, then I do not need or want to run the script on seqB and seqC. I want to find a cluster for each seq in AllJoinRecip, without duplicating clusters, where AllJoinRecip contains a bunch of one to one links. Does this description make sense? If anyone can think of other ways to do this, I would like to hear them.

      Instead of trying to keep it as a script, make it a module and make the script a small wrapper around the module. Then you can use the same code from many places.