Re: Efficiency of a Hash with 1 Million Entries
by ikegami (Patriarch) on Jul 01, 2010 at 22:00 UTC
|
friendoffriend.pl runs very quickly
Yes, but it's executed 1,000,000 times.
Just loading perl takes 3ms
$ time perl -e1
real 0m0.003s
user 0m0.004s
sys 0m0.000s
Loading a basic module triples that.
$ time perl -e'use Scalar::Util;'
real 0m0.009s
user 0m0.004s
sys 0m0.004s
For every record, you do the following actions that could be avoided if you rewrote friendoffriend.pl as a module:
- Launch a new shell
- Launch a new instance of perl
- Recompiling friendoffriend.pl
- Serialize and output the result of friendoffriend.pl
- Read and parse the result of friendoffriend.pl
If that adds up to 12ms, you'd save 3 hours and 20 minutes (12ms * 1,000,000) just by converting friendoffriend.pl into a module.
So, will this large hash significantly impact execution time?
It shouldn't. Assuming all memory access costs the same, inserting into and fetching from a Perl hash takes O(1) time. In other words, the time to do those actions does not increase as the size of the hash increases.
If you use enough memory to require swapping, that would break the assumption.
Update: Reorganised to be clearer. Added actual numbers. | [reply] [d/l] [select] |
Re: Efficiency of a Hash with 1 Million Entries
by BrowserUk (Patriarch) on Jul 01, 2010 at 21:50 UTC
|
The first suggestion I have is that you make friendoffriend.pl into a module and call it as a subroutine rather than starting one million new copies of perl. That would probably save about half of your runtime.
| [reply] |
Re: Efficiency of a Hash with 1 Million Entries
by jethro (Monsignor) on Jul 01, 2010 at 22:19 UTC
|
1) That's the beauty of a hash, a hash lookup will be the same speed, whether 10 or 1 million elements are in it
2) The hash is fast as long as you have enough real memory for it (i.e. don't have to swap out memory to disk).
A waste of time is calling friendoffriend.pl for each line. If you put that code inside the script you save a lot of cpu cycles spent on reserving memory, starting a new perl interpreter, reading the script, doing some inits and throwing all away after the friendoffriend.pl script stops thousands of times
| [reply] |
Re: Efficiency of a Hash with 1 Million Entries
by ssandv (Hermit) on Jul 01, 2010 at 21:49 UTC
|
You might try just having it fetch (and maybe print) $seq for every row. That'd give you a good idea of how much of your time is spent in hash lookup and storage, and row processing. Personally, I suspect that fetching a million rows one at a time might be slow. Alternately, you could consider profiling your code using Devel::NYTProf or something similar. Hash random access is pretty fast by design, but doing *anything* a million times can slow things down.
| [reply] |
Re: Efficiency of a Hash with 1 Million Entries
by dineed (Scribe) on Jul 02, 2010 at 02:58 UTC
|
While others (BrowserUK, Ikegami, and Jethro) have offered excellent advice about correcting issues related to friendoffriend.pl, I'm more curious about the query. If you are getting what appear to be duplicate results for $seq, maybe you can include "SELECT DISTINCT" in the query to reduce the output row count.
| [reply] |
Re: Efficiency of a Hash with 1 Million Entries
by dineed (Scribe) on Jul 02, 2010 at 18:09 UTC
|
Some more about your query. I just noticed that your table name is "AllJoinRecip", leading me to assume this table is a view (or result of some other query) joining one or more tables/views. Things you can do to improve run/build time:
- If you can, make sure your query to build "AllJoinRecip" includes only columns you actually need in the result set
- Be careful to avoid a "Cartesian Product". This is a query that returns every possible combination of every row from every table in the query. You can identify this if your result set contains more rows than all of the tables/views in your "From" clause combined.
- In my earlier response, I suggested using "SELECT DISTINCT" in your query, but I was speaking about the only query to which you provided visibility. Maybe you can use the "SELECT DISTINCT" to build "AllJoinRecip" to keep your source data set as small as possible coming into the Perl script.
I hope this helps.
| [reply] |
Re: Efficiency of a Hash with 1 Million Entries
by gunr (Novice) on Jul 02, 2010 at 22:11 UTC
|
Thanks again for the help. By turning friendoffriend.pl into a module, the execution time is now about 2 minutes! I had no idea that running another script through a system call would cause such a slow down. If anyone is looking for a really simple and easy to understand tutorial on making a Perl module, here is the site I used: http://www.webreference.com/programming/perl/modules/ The only thing now, though, is the results that I was getting before differ from the results I am getting now (about 224,000 clusters previously, now only 124,000 clusters). So I will have to go back through my code and see what could have caused this change.
In regard to the AllJoinRecip table, I must have incorrectly remembered the amount of time it takes to make that table. It actually only takes 10 seconds. I have no idea which table I remember taking 5 minutes to create. I am pretty sure that the AllJoinRecip table follows the suggestions you made, dineed, but I will double check to make sure. Thanks again!
| [reply] |
|
|
| [reply] [d/l] |
|
|
Thanks for pointing this out.
For those keeping score: It turns out that my original results file had some duplicate entries somehow. I must have messed up somewhere in the script I guess. The new results do not have any duplicates, so it is a much more accurate representation of my data.
| [reply] |
Re: Efficiency of a Hash with 1 Million Entries
by gunr (Novice) on Jul 02, 2010 at 15:27 UTC
|
Thanks for the help, everyone. I am currently reading about how to turn my friendoffriend.pl script into a module. I want to keep this script as a separate file, rather than making it a subroutine within the calling program, so that I can use frinedoffriend.pl on its own. If anyone know of a good tutorial for converting perl scripts in a module, please let me know.
In regards to the query:
friendoffriend.pl is used to cluster data in the AllJoinRecip table, so given an input $seq, friendoffriend.pl finds all of the other "nodes" that $seq can reach, either directly or indirectly. So once I've run the script on seqA and found that seqA, seqB, and seqC are all in a cluster, then I do not need or want to run the script on seqB and seqC. I want to find a cluster for each seq in AllJoinRecip, without duplicating clusters, where AllJoinRecip contains a bunch of one to one links. Does this description make sense? If anyone can think of other ways to do this, I would like to hear them. | [reply] |
|
|
| [reply] |