Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I suppose, lacking a CS background, I'm asking for trouble with a question like
this, but I might as well try:

Recently I wrote a script that compares lines in two files. I thought I'd done a
half-decent job of it until I found that a programmer at work had already done the
same thing in C, and the C program ran 75 times faster than the perl script. Oops.
(The substantial difference became apparent when the input files got fairly large.
The C program was able to go out for a nice leisurely lunch while the perl script was
still trying to figure out how to tie its shoelaces.)

Then, after reading one of the perl FAQ files referred to in a recent post here, I
re-wrote the script using hashes, not arrays. The C program remained faster than the
perl script--but the difference in execution times had been drastically reduced.

What is it about hashes that can make such a dramatic difference in
execution speed?

Replies are listed 'Best First'.
Re: Why are hashes so much faster?
by plaid (Chaplain) on Jun 18, 2000 at 01:53 UTC
    For a simple lookup at a given index, an array cannot be beaten for pure speed. I'm assuming what you were trying to do in the original script was check to see if a certain value was contained in an array. I will use a bit of terminology from lhoward's excellent post here, namely the big-O notation.

    Hashing is done by taking the key of the hash, performing a "hashing" function on this key, and obtaining a number from it. Then, this number is used as an index into an array. A good hashing function will produce an array that is fairly well populated. For instance, if your key is 'name', a function will be called with the argument of 'name'. Some pseudocode might look like

    my $random = hashing_function('name'); $array_of_hash_values[$random] = $value;
    The hashing_function is usually something that calculates a value based on the string, and performs a modulo (%) by the fixed size of the array. Then, for any subsequent lookups into the hash table, you run the key through the hashing_function again, and take whatever is at that index.
    sub lookup_value { my $index = hashing_function(shift()); return $array_of_hash_values[$index]; }
    If more than one key hashes to the same index, usually an array is built off of that array, which is why hashes are -ideally- O(1), but not necessarily.

    Array lookups are always O(1), and faster than hash lookups since they don't have to perform any hashing function beforehand. My guess is that you were having to do some traversing of the array to look for a certain value, in which case, your lookup function would be O(n), where n is the size of the array. By switching to a hash table, you were able to increase the speed of this lookup function. Other places where a hash would help would be in a delete function, as you could just get rid of the value, instead of having to worry about shifting all values down if the deleted index is in the middle of the array.

    In general, hashes are much more efficient than arrays for anything that needs to have a certain relationship between index and its value, but arrays are much faster than hashes if all you want is a list of values where the index is unimportant.

    Hope this all makes sense.

      (I'm the one who posted the question...I can't get my logins here
      to "take," and everything I post ends up under the name
      "Anonymous Monk" whether I log in or not. Sigh...)

      > My guess is that you were having to do some traversing
      > of the array to look for a certain value

      Yes, it was a case of<KBD> if ( $_ eq 'whatever' ) </KBD>while looping through the array.
      I thought perhaps it'd be faster via "grep" (don't know why I thought so)--
      it wasn't. The solution via hash was stunningly faster.

      Thanks for these replies. I don't get all of it yet, but slowly, slowly...

Re: Why are hashes so much faster?
by btrott (Parson) on Jun 18, 2000 at 01:33 UTC
    It depends on what your algorithm is. Hashes are not, by nature, faster than arrays; in certain algorithms, though, they provide much better performance.

    One particular algorithm that's much faster using hashes is doing a lookup to see if a particular item is in a list. If you use an (unsorted) array, you need to look through each item in the array to determine if you have the item in your list. So it's a O(n) algorithm, using an array. Using a hash, it's a O(1)--constant--algorithm. A hashing algorithm translates a string into an address; the hashing function should always give the same address when using the same string, so looking up a string in a hash is extremely fast and simple.

Re: Why are hashes so much faster?
by Anonymous Monk on Jun 18, 2000 at 06:44 UTC
    I wonder whether some systems implement grep internally using hashes, i.e., the first time you grep a file it gets hashed and the system stores the hash in memory for subsequent greps. In that case, my fellow Anonymous Monk's original algorithm would have been as fast as a hash implementation.
Re: Why are hashes so much faster?
by Yossarian2000 (Initiate) on Jun 19, 2000 at 18:47 UTC
    Some of the speed differences may be due to the fact that the other program was written in C. In general, compiled languages (like C) are faster than interpreted languages (like Perl).

      As it happened, changing the script to use hashes made it considerably faster--so much so that it became almost as convenient to use as the C program. Actually, it became more convenient because I'd built better error-checking into it. The C program was content to try reading a binary file--of course, that'd generate an error--whereas the perl script refused to mess with binaries.

      In that regard, at least, I judged the script "ok"; and because it was no longer grotesquely slow, I could even (cautiously) consider it "better." Of course, if the C programmer beefs up the error-checking, then his app wins again. :)

        You know, you could post the code here and solicit suggestions to speed things up even further.

        It might be a fun little project...

      Yes, yes, yes - I find that Perl is faster to WRITE, not faster to execute (on a general basis).