Optimizing hash lookups

arturo has asked for the wisdom of the Perl Monks concerning the following question:

The problem: in a (perhaps misguided) attempt to take strain off my DBMS, I have a hash whose keys are labels and whose values are numerical IDs. I want to get from my textual data to the numerical IDs.

That way, Perl does the work instead of the DBMS (the alternative is to look up the label in the DB for each iteration through the loop, which gets run around 3 million times -- I haven't benchmarked, but I don't think caching will help much; besides the system running the perl script has 4gb and it's *much* faster than the DB server)

This is a backend script that parses an Apache access log and inserts the data into the DB.

Anyhow ... the labels are location names, complete with trailing slashes. Following that is a little ID that tells me which virtual host the location lives on. The values, as mentioned above, are the IDs in the database corresponding to each location. Thus, the keys / values look something like this: /foo/-17=> 56 /rastapopulous/woohoo/-19=>67

The data I'm trying to mangle (lines in an Apache logfile) doesn't always cooperate -- the trailing slash is not always present if someone requests the "/foo/" location (e.g. the request string looks like "www.foo.bar.net/foo") so my sub that takes the request string and returns the ID currently looks like this:

sub get_location_id {
    my ($loc_string, $vhost_id) = @_;
    # %locations is a global hash, the one that holds all the values
    return $locations{"$loc_string-$vhost_id"} if 
        defined($locations{"$loc_string-$vhost_id"};
    # OK, maybe we didn't find the whole thing, but we still
    # want to know what the top level of the tree was, 
    # i.e. /foo/bar/ may not be in the lookup hash, but
    # if /foo/ is we want to log it
    my ($chopped_loc) = ($loc_string =~ m#^([^/]+/?#);
    return $locations{"$chopped_loc-$vhost_id"} if 
        defined $locations{"$chopped_loc-$vhost_id"};
    # OK, that handles it if the trailing slash is present
    # in the data ...
    return $locations{"$chopped_loc/-$vhost_id"} if
        defined{"$chopped_loc/-$vhost_id"};

    # log it to the catch-all "other" location if 
    # we've fallen through this far
    return $locations{"other-$vhost_id"};
}
[download]

I'm *sure* it's inelegant. I wouldn't be surprised to find out it's inefficient. How might I improve it?

I have a large file whic

Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Comment on Optimizing hash lookups Download Code

Replies are listed 'Best First'.
Re: Optimizing hash lookups by Fastolfe (Vicar) on Oct 31, 2000 at 01:18 UTC
Not sure if this affects your algorithm, but typically if you have a directory set up at `/foo/` and someone requests `/foo`, the server will redirect the browser to `/foo/`, so your access logs will have a hit for each. You should also use exists instead of defined for hash keys. Your code also has a few errors.. Not sure if you just copied this by hand or if this is actual code. I'm going to assume the former, but I'll give you a list if it's the latter.	[reply] [d/l] [select]
Re: Optimizing hash lookups by arturo (Vicar) on Oct 31, 2000 at 01:21 UTC
Oh, my, and it's not even correct. I should be checking the first time for missing '/' ... and in fact, maybe I could solve the whole dang problem simply by force-appending a '/' to the $loc_string before passing it to the routine (but that's expensive too, as I don't want to do it because files are valid locations.) I still think there's some theoretical interest here, though. Philosophy can be made out of anything. Or less -- Jerry A. Fodor	[reply]