The problem: in a (perhaps misguided) attempt to take strain off my DBMS, I have a hash whose keys are labels and whose values are numerical IDs. I want to get from my textual data to the numerical IDs.

That way, Perl does the work instead of the DBMS (the alternative is to look up the label in the DB for each iteration through the loop, which gets run around 3 million times -- I haven't benchmarked, but I don't think caching will help much; besides the system running the perl script has 4gb and it's *much* faster than the DB server)

This is a backend script that parses an Apache access log and inserts the data into the DB.

Anyhow ... the labels are location names, complete with trailing slashes. Following that is a little ID that tells me which virtual host the location lives on. The values, as mentioned above, are the IDs in the database corresponding to each location. Thus, the keys / values look something like this: /foo/-17=> 56 /rastapopulous/woohoo/-19=>67

The data I'm trying to mangle (lines in an Apache logfile) doesn't always cooperate -- the trailing slash is not always present if someone requests the "/foo/" location (e.g. the request string looks like "www.foo.bar.net/foo") so my sub that takes the request string and returns the ID currently looks like this:

sub get_location_id { my ($loc_string, $vhost_id) = @_; # %locations is a global hash, the one that holds all the values return $locations{"$loc_string-$vhost_id"} if defined($locations{"$loc_string-$vhost_id"}; # OK, maybe we didn't find the whole thing, but we still # want to know what the top level of the tree was, # i.e. /foo/bar/ may not be in the lookup hash, but # if /foo/ is we want to log it my ($chopped_loc) = ($loc_string =~ m#^([^/]+/?#); return $locations{"$chopped_loc-$vhost_id"} if defined $locations{"$chopped_loc-$vhost_id"}; # OK, that handles it if the trailing slash is present # in the data ... return $locations{"$chopped_loc/-$vhost_id"} if defined{"$chopped_loc/-$vhost_id"}; # log it to the catch-all "other" location if # we've fallen through this far return $locations{"other-$vhost_id"}; }

I'm *sure* it's inelegant. I wouldn't be surprised to find out it's inefficient. How might I improve it?

I have a large file whic

Philosophy can be made out of anything. Or less -- Jerry A. Fodor


In reply to Optimizing hash lookups by arturo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.