PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow Monks,

I have a hash and need to look whether a combination of
its keys matches a searchstring. The thingie looks like that:

foreach my $combine (sort keys %$self) { if($key =~ /^$combine(.*)$/) { print "Found (prefix $combine): $1\n"; } if($key =~ /^(.*)$combine$/) { print "Found (suffix $combine): $1\n"; } }
This works pretty well, but is pretty slow (big hash). Now
I see that much eficiency could be gained by using the sort
and just iterating through the sublist of keys containing the
same first char as the searchstring ($key).
Unfortunately I cannot figure out how to reduce the
sort keys %$self list without copying and
fiddling around. Also, I suspect that the regexp are way
too primitive and expensive for this simple task.

Has anyone a helping advice for me? Thank you very much.

Ciao

Replies are listed 'Best First'.
Re (tilly) 1: Fast sublist generation
by tilly (Archbishop) on Jul 30, 2001 at 00:57 UTC
    The fastest solution is to use DB_File to allow you to store the hash using Berkeley DB in a B_TREE. Then open it with the R_CURSOR flag set to true, use the OO interface (you can get the object back from the tied hash using tied) and you can locate your key by using the seq method with R_LAST.

    Unless you need persistence, you should use an in-memory database.

Re: Fast sublist generation
by larsen (Parson) on Jul 28, 2001 at 19:57 UTC
    You could avoid regexp using index:

    if(index($key, $combine) != -1) { print "Found (prefix $combine): $1\n"; }

    It should be more efficient than compiling and executing a regexp. For the suffix code you could combine index and length.

Re: Fast sublist generation
by wog (Curate) on Jul 28, 2001 at 22:36 UTC
    You may be able to use substr and length to accomplish this task much faster:

    chomp $key; # don't care about trailing newline, if any. foreach my $combine (keys %$self) { if (substr($key, 0, length $combine) eq $combine) { print "Found (prefix $combine): ", substr ($key, length $combine), + "\n"; } if (substr($key, -length $combine) eq $combine) { print "Found (suffix $combine): ", substr ($key, 0, length($key) - length $combine), "\n"; } }

    This should avoid the overhead of compiling a regex, and the overhead of index (or rindex) searching in more places in the string then are needed.

    It will of course not work if you want regex-type searching, of the keys of %$self.

    I think that unless you could have your keys stored already ordered, it is likely that sorting overhead would outway any advantage gained through a binary search, etc.

Re: Fast sublist generation
by runrig (Abbot) on Jul 28, 2001 at 23:42 UTC
    Another idea:
    my ($prefix, $combine, $suffix) = split /(combine)/, $key, 2; if ($combine) { print "$prefix found\n" if $prefix; print "$suffix found\n" if $suffix; }
Re: Fast sublist generation
by kschwab (Vicar) on Jul 28, 2001 at 19:00 UTC
    Two ideas:

    1. The sort() is probably the most expensive thing you are doing. With the code snippet you've provided, there doesn't seem to be any need for it. If you really need the sort, sort the matching results rather than the input list of keys.
    2. Are you really doing anything with the (.*) portions of the regexen ? Reducing the regex to /^$combine/ would speed it up ( and the other one to /$combine$/ ).
      Consider the code snippet being a snippet, not the
      full code. Actually it looks something like that:
      foreach my $morph (sort keys %$self) { if($key =~ /^$morph(.*)$/) { print "Found (prefix $morph): $1\n"; return ($self->find('DE',$1), "P:$morph"); } if($key =~ /^(.*)$morph$/) { print "Found (suffix $morph): $1\n"; return ($self->find('DE',$1), "S:$morph"); } }
      And yes, the sort wouldnŽt be necessary at the moment,
      but if I could manage to utilize sort by cutting the sorted
      list to a sublist (roughly 1/30 of the searchspace), then
      the speedup would be an order of magnitude.

      Ciao

        Well, you could reduce the input sort list, but at the cost of more comparisons. In this case, only sorting keys that contain the desired $string. Something like:
        foreach my $key (sort grep(/$string/,keys %hash)) { if($key =~ /^$string(.*)$/) { blah; } if($key =~ /^(.*)$string$/) { blah; } }
        This seems to speed things up for large hashes provided the matching list is a pretty small subset of the input.

        Update: Since you are return()ing the first time you find a match, the sort() is doing more work than you need. You really need a min() function. There's a node that discussed various ways to implement a min().

Blazingly FAST
by PetaMem (Priest) on Jul 29, 2001 at 11:09 UTC
    Fellow Monks,

    There is a method, that does the whole trick about 3
    orders of magnitude faster (for the hash size IŽm dealing
    with i.e. 50.000 entries) PLUS it doesnŽt need recursion.
    IŽll provide only the pseudo-code, because I got this idea
    when I woke up today:

    n=1; get the first n chars of $key; make a hash-lookup; if exists put to list of prefixes; get the last n chars of $key; make a hash-lookup; if exists put to list of suffixes; if n<length($key) { n++; reiterate }
    That way I reduce about 50.000 regexps to about 40 hash
    lookups. I LOVE sunday mornings. :-)

    Ciao

      This doesn't do what you want it to. Let's take a small example:

      %hash = (a => 1, ab => 1, abc => 1, ac => 1); $key = 'ab';
      Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab').

      Let's examine the problem. You need to find all the keys that start or end with a given string. To do this, you need to get a list of the keys.

      Now, once you have that list of keys, treat it as if it's just a list of strings with no duplicates. The shortest Perl code to find this would be:

      my @list_of_keys = sort grep /(^$search|$search$)/o, keys %hash;
      There are a few optimizations you can do (I'd think of more, but I haven't had my first pot of coffee yet):
      • Remove all the keys that are shorter than your search string.
      • Convert the regex to a substr. (This should take care of the keys that are too short, too.)
      Putting both together yields:

      my @list_of_keys = sort grep { substr($_, 0, length($search)) eq $sear +ch || substr($_, -length($search)) eq $search + }, keys %hash;
      </code> Please note that this code is completely untested.
        "Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab')."

        Lucky me, that is more correct than the "first code". If you
        replace $key with $word, youŽll see the reason why:

        Every entry in the hash is a potentional prefix (or suffix
        for that matter) of $word. Thus yes, the blazingly fast(tm)
        code does the right thing.(tm)

        I encountered a second problem, that is, that simple prefix
        suffix consideration is far away from being enough (no worry
        - it alway was meant as first approximation). But there are
        in various languages all kinds of morphemes, suffixes, prefixes,
        infixes, reduplication, elipses etc. In fact, if you look
        at a given word of n chars, the FULL segmentation would
        require you to produce 2^(n-1) alternatives.

        Try that with a 39char word (Eisenbahnerwohnungsnutzungsverordnungen)

        So I thought we reduce it to already present entries of a lexicon.

        Ciao