in reply to Fast sublist generation

Fellow Monks,

There is a method, that does the whole trick about 3
orders of magnitude faster (for the hash size IŽm dealing
with i.e. 50.000 entries) PLUS it doesnŽt need recursion.
IŽll provide only the pseudo-code, because I got this idea
when I woke up today:

n=1; get the first n chars of $key; make a hash-lookup; if exists put to list of prefixes; get the last n chars of $key; make a hash-lookup; if exists put to list of suffixes; if n<length($key) { n++; reiterate }
That way I reduce about 50.000 regexps to about 40 hash
lookups. I LOVE sunday mornings. :-)

Ciao

Replies are listed 'Best First'.
Re: Blazingly FAST (... yet oh so wrong)
by dragonchild (Archbishop) on Jul 30, 2001 at 17:46 UTC
    This doesn't do what you want it to. Let's take a small example:

    %hash = (a => 1, ab => 1, abc => 1, ac => 1); $key = 'ab';
    Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab').

    Let's examine the problem. You need to find all the keys that start or end with a given string. To do this, you need to get a list of the keys.

    Now, once you have that list of keys, treat it as if it's just a list of strings with no duplicates. The shortest Perl code to find this would be:

    my @list_of_keys = sort grep /(^$search|$search$)/o, keys %hash;
    There are a few optimizations you can do (I'd think of more, but I haven't had my first pot of coffee yet):
    • Remove all the keys that are shorter than your search string.
    • Convert the regex to a substr. (This should take care of the keys that are too short, too.)
    Putting both together yields:

    my @list_of_keys = sort grep { substr($_, 0, length($search)) eq $sear +ch || substr($_, -length($search)) eq $search + }, keys %hash;
    </code> Please note that this code is completely untested.
      "Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab')."

      Lucky me, that is more correct than the "first code". If you
      replace $key with $word, youŽll see the reason why:

      Every entry in the hash is a potentional prefix (or suffix
      for that matter) of $word. Thus yes, the blazingly fast(tm)
      code does the right thing.(tm)

      I encountered a second problem, that is, that simple prefix
      suffix consideration is far away from being enough (no worry
      - it alway was meant as first approximation). But there are
      in various languages all kinds of morphemes, suffixes, prefixes,
      infixes, reduplication, elipses etc. In fact, if you look
      at a given word of n chars, the FULL segmentation would
      require you to produce 2^(n-1) alternatives.

      Try that with a 39char word (Eisenbahnerwohnungsnutzungsverordnungen)

      So I thought we reduce it to already present entries of a lexicon.

      Ciao

        Fair enough. Your original problem statement presented the situation as you had a list of words and were searching to determine which words had the search string as a prefix or suffix. That is what your original code did, and what your second code did not do.

        I agree that storing a list of prefixes (in one hash) and suffixes (in another hash) is definitely preferable (if possible) and will result in a huge savings in time (and probably memory, too!). Restating the problemspace is a very good thing to do, when you're stuck like that.