Fast sublist generation

PetaMem has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re (tilly) 1: Fast sublist generation by tilly (Archbishop) on Jul 30, 2001 at 00:57 UTC
The fastest solution is to use DB_File to allow you to store the hash using Berkeley DB in a B_TREE. Then open it with the R_CURSOR flag set to true, use the OO interface (you can get the object back from the tied hash using tied) and you can locate your key by using the seq method with R_LAST. Unless you need persistence, you should use an in-memory database.	[reply]
Re: Fast sublist generation by larsen (Parson) on Jul 28, 2001 at 19:57 UTC
You could avoid regexp using `index`: `if(index($key, $combine) != -1) { print "Found (prefix $combine): $1\n"; }` [download] It should be more efficient than compiling and executing a regexp. For the suffix code you could combine `index` and `length`.	[reply] [d/l] [select]
Re: Fast sublist generation by wog (Curate) on Jul 28, 2001 at 22:36 UTC
You may be able to use substr and length to accomplish this task much faster: `chomp $key; # don't care about trailing newline, if any. foreach my $combine (keys %$self) { if (substr($key, 0, length $combine) eq $combine) { print "Found (prefix $combine): ", substr ($key, length $combine), + "\n"; } if (substr($key, -length $combine) eq $combine) { print "Found (suffix $combine): ", substr ($key, 0, length($key) - length $combine), "\n"; } }` [download] This should avoid the overhead of compiling a regex, and the overhead of index (or rindex) searching in more places in the string then are needed. It will of course not work if you want regex-type searching, of the keys of `%$self`. I think that unless you could have your keys stored already ordered, it is likely that sorting overhead would outway any advantage gained through a binary search, etc.	[reply] [d/l]
Re: Fast sublist generation by runrig (Abbot) on Jul 28, 2001 at 23:42 UTC
Another idea: `my ($prefix, $combine, $suffix) = split /(combine)/, $key, 2; if ($combine) { print "$prefix found\n" if $prefix; print "$suffix found\n" if $suffix; }` [download]	[reply] [d/l]
Re: Fast sublist generation by kschwab (Vicar) on Jul 28, 2001 at 19:00 UTC
Two ideas: The sort() is probably the most expensive thing you are doing. With the code snippet you've provided, there doesn't seem to be any need for it. If you really need the sort, sort the matching results rather than the input list of keys. Are you really doing anything with the (.*) portions of the regexen ? Reducing the regex to /^$combine/ would speed it up ( and the other one to /$combine$/ ).	[reply]
Re: Re: Fast sublist generation by PetaMem (Priest) on Jul 28, 2001 at 19:04 UTC
Consider the code snippet being a snippet, not the full code. Actually it looks something like that: `foreach my $morph (sort keys %$self) { if($key =~ /^$morph(.)$/) { print "Found (prefix $morph): $1\n"; return ($self->find('DE',$1), "P:$morph"); } if($key =~ /^(.)$morph$/) { print "Found (suffix $morph): $1\n"; return ($self->find('DE',$1), "S:$morph"); } }` [download] And yes, the sort wouldn´t be necessary at the moment, but if I could manage to utilize sort by cutting the sorted list to a sublist (roughly 1/30 of the searchspace), then the speedup would be an order of magnitude. Ciao	[reply] [d/l]
Re: Re: Re: Fast sublist generation by kschwab (Vicar) on Jul 28, 2001 at 19:15 UTC
Well, you could reduce the input sort list, but at the cost of more comparisons. In this case, only sorting keys that contain the desired $string. Something like: `foreach my $key (sort grep(/$string/,keys %hash)) { if($key =~ /^$string(.)$/) { blah; } if($key =~ /^(.)$string$/) { blah; } }` [download] This seems to speed things up for large hashes provided the matching list is a pretty small subset of the input. Update: Since you are return()ing the first time you find a match, the sort() is doing more work than you need. You really need a min() function. There's a node that discussed various ways to implement a min().	[reply] [d/l]
Blazingly FAST by PetaMem (Priest) on Jul 29, 2001 at 11:09 UTC
Fellow Monks, There is a method, that does the whole trick about 3 orders of magnitude faster (for the hash size I´m dealing with i.e. 50.000 entries) PLUS it doesn´t need recursion. I´ll provide only the pseudo-code, because I got this idea when I woke up today: `n=1; get the first n chars of $key; make a hash-lookup; if exists put to list of prefixes; get the last n chars of $key; make a hash-lookup; if exists put to list of suffixes; if n<length($key) { n++; reiterate }` [download] That way I reduce about 50.000 regexps to about 40 hash lookups. I LOVE sunday mornings. :-) Ciao	[reply] [d/l]
Re: Blazingly FAST (... yet oh so wrong) by dragonchild (Archbishop) on Jul 30, 2001 at 17:46 UTC
This doesn't do what you want it to. Let's take a small example: `%hash = (a => 1, ab => 1, abc => 1, ac => 1); $key = 'ab';` [download] Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab'). Let's examine the problem. You need to find all the keys that start or end with a given string. To do this, you need to get a list of the keys. Now, once you have that list of keys, treat it as if it's just a list of strings with no duplicates. The shortest Perl code to find this would be: `my @list_of_keys = sort grep /(^$search\|$search$)/o, keys %hash;` [download] There are a few optimizations you can do (I'd think of more, but I haven't had my first pot of coffee yet): Remove all the keys that are shorter than your search string. Convert the regex to a substr. (This should take care of the keys that are too short, too.) Putting both together yields: `my @list_of_keys = sort grep { substr($_, 0, length($search)) eq $sear +ch \|\| substr($_, -length($search)) eq $search + }, keys %hash;` [download] </code> Please note that this code is completely untested.	[reply] [d/l] [select]
Re: Re: Blazingly FAST (and very right) by PetaMem (Priest) on Jul 30, 2001 at 18:01 UTC
"Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab')." Lucky me, that is more correct than the "first code". If you replace $key with $word, you´ll see the reason why: Every entry in the hash is a potentional prefix (or suffix for that matter) of $word. Thus yes, the blazingly fast(tm) code does the right thing.(tm) I encountered a second problem, that is, that simple prefix suffix consideration is far away from being enough (no worry - it alway was meant as first approximation). But there are in various languages all kinds of morphemes, suffixes, prefixes, infixes, reduplication, elipses etc. In fact, if you look at a given word of n chars, the FULL segmentation would require you to produce 2^(n-1) alternatives. Try that with a 39char word (Eisenbahnerwohnungsnutzungsverordnungen) So I thought we reduce it to already present entries of a lexicon. Ciao	[reply]
Re: {3} Blazingly FAST (and very right) by dragonchild (Archbishop) on Jul 30, 2001 at 18:17 UTC