Blazingly FAST

Fellow Monks,

There is a method, that does the whole trick about 3
orders of magnitude faster (for the hash size I´m dealing
with i.e. 50.000 entries) PLUS it doesn´t need recursion.
I´ll provide only the pseudo-code, because I got this idea
when I woke up today:

n=1;
get the first n chars of $key;
make a hash-lookup; if exists put to list of prefixes;
get the last n chars of $key;
make a hash-lookup; if exists put to list of suffixes;
if n<length($key) { n++; reiterate }
[download]

That way I reduce about 50.000 regexps to about 40 hash
lookups. I LOVE sunday mornings. :-)

Ciao

Comment on Blazingly FAST Download Code

Replies are listed 'Best First'.
Re: Blazingly FAST (... yet oh so wrong) by dragonchild (Archbishop) on Jul 30, 2001 at 17:46 UTC
This doesn't do what you want it to. Let's take a small example: `%hash = (a => 1, ab => 1, abc => 1, ac => 1); $key = 'ab';` [download] Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab'). Let's examine the problem. You need to find all the keys that start or end with a given string. To do this, you need to get a list of the keys. Now, once you have that list of keys, treat it as if it's just a list of strings with no duplicates. The shortest Perl code to find this would be: `my @list_of_keys = sort grep /(^$search\|$search$)/o, keys %hash;` [download] There are a few optimizations you can do (I'd think of more, but I haven't had my first pot of coffee yet): Remove all the keys that are shorter than your search string. Convert the regex to a substr. (This should take care of the keys that are too short, too.) Putting both together yields: `my @list_of_keys = sort grep { substr($_, 0, length($search)) eq $sear +ch \|\| substr($_, -length($search)) eq $search + }, keys %hash;` [download] </code> Please note that this code is completely untested.	[reply] [d/l] [select]
Re: Re: Blazingly FAST (and very right) by PetaMem (Priest) on Jul 30, 2001 at 18:01 UTC
"Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab')." Lucky me, that is more correct than the "first code". If you replace $key with $word, you´ll see the reason why: Every entry in the hash is a potentional prefix (or suffix for that matter) of $word. Thus yes, the blazingly fast(tm) code does the right thing.(tm) I encountered a second problem, that is, that simple prefix suffix consideration is far away from being enough (no worry - it alway was meant as first approximation). But there are in various languages all kinds of morphemes, suffixes, prefixes, infixes, reduplication, elipses etc. In fact, if you look at a given word of n chars, the FULL segmentation would require you to produce 2^(n-1) alternatives. Try that with a 39char word (Eisenbahnerwohnungsnutzungsverordnungen) So I thought we reduce it to already present entries of a lexicon. Ciao	[reply]
Re: {3} Blazingly FAST (and very right) by dragonchild (Archbishop) on Jul 30, 2001 at 18:17 UTC
Fair enough. Your original problem statement presented the situation as you had a list of words and were searching to determine which words had the search string as a prefix or suffix. That is what your original code did, and what your second code did not do. I agree that storing a list of prefixes (in one hash) and suffixes (in another hash) is definitely preferable (if possible) and will result in a huge savings in time (and probably memory, too!). Restating the problemspace is a very good thing to do, when you're stuck like that.	[reply]