Re (tilly) 1: Fast sublist generation
by tilly (Archbishop) on Jul 30, 2001 at 00:57 UTC
|
The fastest solution is to use DB_File to allow
you to store the hash using
Berkeley DB in a B_TREE. Then
open it with the R_CURSOR flag set to true, use the OO
interface (you can get the object back from the tied hash
using tied) and you can locate your key by using the seq
method with R_LAST.
Unless you need persistence, you should use an in-memory
database. | [reply] |
Re: Fast sublist generation
by larsen (Parson) on Jul 28, 2001 at 19:57 UTC
|
You could avoid regexp using index:
if(index($key, $combine) != -1) {
print "Found (prefix $combine): $1\n";
}
It should be more efficient than compiling and executing a regexp. For the suffix code you could combine index and length. | [reply] [d/l] [select] |
Re: Fast sublist generation
by wog (Curate) on Jul 28, 2001 at 22:36 UTC
|
You may be able to use substr and length to accomplish
this task much faster:
chomp $key; # don't care about trailing newline, if any.
foreach my $combine (keys %$self) {
if (substr($key, 0, length $combine) eq $combine) {
print "Found (prefix $combine): ", substr ($key, length $combine),
+ "\n";
}
if (substr($key, -length $combine) eq $combine) {
print "Found (suffix $combine): ",
substr ($key, 0, length($key) - length $combine), "\n";
}
}
This should avoid the overhead of compiling a regex, and
the overhead of index (or rindex) searching in more places in the
string then are needed.
It will of course not work if you want regex-type searching,
of the keys of %$self.
I think that unless you could have your keys stored
already ordered, it is likely that sorting overhead
would outway any advantage gained through a binary
search, etc.
| [reply] [d/l] |
Re: Fast sublist generation
by runrig (Abbot) on Jul 28, 2001 at 23:42 UTC
|
my ($prefix, $combine, $suffix) = split /(combine)/, $key, 2;
if ($combine) {
print "$prefix found\n" if $prefix;
print "$suffix found\n" if $suffix;
}
| [reply] [d/l] |
Re: Fast sublist generation
by kschwab (Vicar) on Jul 28, 2001 at 19:00 UTC
|
Two ideas:
- The sort() is probably the most expensive thing
you are doing. With the code snippet you've provided,
there doesn't seem to be any need for it. If you really need the sort, sort the matching results rather than the input list of keys.
- Are you really doing anything with the (.*) portions
of the regexen ? Reducing the regex to /^$combine/ would
speed it up ( and the other one to /$combine$/ ).
| [reply] |
|
|
Consider the code snippet being a snippet, not the
full code. Actually it looks something like that:
foreach my $morph (sort keys %$self) {
if($key =~ /^$morph(.*)$/) {
print "Found (prefix $morph): $1\n";
return ($self->find('DE',$1), "P:$morph");
}
if($key =~ /^(.*)$morph$/) {
print "Found (suffix $morph): $1\n";
return ($self->find('DE',$1), "S:$morph");
}
}
And yes, the sort wouldnŽt be necessary at the moment,
but if I could manage to utilize sort by cutting the sorted
list to a sublist (roughly 1/30 of the searchspace), then
the speedup would be an order of magnitude.
Ciao
| [reply] [d/l] |
|
|
Well, you could reduce the input sort list,
but at the cost of more comparisons. In this
case, only sorting keys that contain the desired
$string.
Something like:
foreach my $key (sort grep(/$string/,keys %hash)) {
if($key =~ /^$string(.*)$/) {
blah;
}
if($key =~ /^(.*)$string$/) {
blah;
}
}
This seems to speed things up for large hashes provided
the matching list is a pretty small subset of the input.
Update: Since you are return()ing the
first time you find a match, the sort() is doing more
work than you need. You really need a min() function.
There's a node that discussed various ways
to implement a min(). | [reply] [d/l] |
Blazingly FAST
by PetaMem (Priest) on Jul 29, 2001 at 11:09 UTC
|
Fellow Monks,
There is a method, that does the whole trick about 3
orders of magnitude faster (for the hash size IŽm dealing
with i.e. 50.000 entries) PLUS it doesnŽt need recursion.
IŽll provide only the pseudo-code, because I got this idea
when I woke up today:
n=1;
get the first n chars of $key;
make a hash-lookup; if exists put to list of prefixes;
get the last n chars of $key;
make a hash-lookup; if exists put to list of suffixes;
if n<length($key) { n++; reiterate }
That way I reduce about 50.000 regexps to about 40 hash
lookups. I LOVE sunday mornings. :-)
Ciao
| [reply] [d/l] |
|
|
This doesn't do what you want it to. Let's take a small example:
%hash = (a => 1,
ab => 1,
abc => 1,
ac => 1);
$key = 'ab';
Your first code would return ('ab', 'abc'). The code I'm replying to will return ('a', 'ab').
Let's examine the problem. You need to find all the keys that start or end with a given string. To do this, you need to get a list of the keys.
Now, once you have that list of keys, treat it as if it's just a list of strings with no duplicates. The shortest Perl code to find this would be:
my @list_of_keys = sort grep /(^$search|$search$)/o, keys %hash;
There are a few optimizations you can do (I'd think of more, but I haven't had my first pot of coffee yet):
- Remove all the keys that are shorter than your search string.
- Convert the regex to a substr. (This should take care of the keys that are too short, too.)
Putting both together yields:
my @list_of_keys = sort grep { substr($_, 0, length($search)) eq $sear
+ch ||
substr($_, -length($search)) eq $search
+ },
keys %hash;
</code>
Please note that this code is completely untested. | [reply] [d/l] [select] |
|
|
"Your first code would return ('ab', 'abc'). The code I'm replying to will return
('a', 'ab')."
Lucky me, that is more correct than the "first code". If you
replace $key with $word, youŽll see the reason why:
Every entry in the hash is a potentional prefix (or suffix
for that matter) of $word. Thus yes, the blazingly fast(tm)
code does the right thing.(tm)
I encountered a second problem, that is, that simple prefix
suffix consideration is far away from being enough (no worry
- it alway was meant as first approximation). But there are
in various languages all kinds of morphemes, suffixes, prefixes,
infixes, reduplication, elipses etc. In fact, if you look
at a given word of n chars, the FULL segmentation would
require you to produce 2^(n-1) alternatives.
Try that with a 39char word (Eisenbahnerwohnungsnutzungsverordnungen)
So I thought we reduce it to already present entries of a lexicon.
Ciao
| [reply] |
|
|