princepawn has asked for the wisdom of the Perl Monks concerning the following question:
The goal is, for each line in A, to return the best match from B using a subroutine named fuzzy_match, a function that takes two strings and returns a float from 0 to 1.
Now, let's assume that file B is enormous, making the prospect of applying fuzzy_match to each member infeasible. But let's also assume that the first character of each member of B will always be the best result from fuzzy_match for A. This means that instead of looking through all of B, you simply need to retrieve all records from B which start with the same first letter as the current record in A.
Hence you only search a "bucket" of B instead of all of B, saving you a good bit of time.
Now, I could write such an indexing / bucketing routine myself pretty easily, but I'm surprised that such a routine has not already been written. However CPAN showed no results for such a beast... any leads?
So, the best interface would be useable under a variety of hashing strategies... so the desired interface would be along the lines of:
my $hash = Data::Bucket->new; $hash->reflect->addSlots( hashing_strategy => sub { my $self = shift; my $string = shift; return substr(0, 2, $string) ; } ); while (<$search_file>) { $hash->store($_); } # then later for (<$in_file>) { my @bucket = $hash->bucket_for($_); my $best_match = fuzzy_match($_, @bucket); .. do something with best match ... }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: optimizing a linear search by Indexed or bucketed hashing
by GrandFather (Saint) on Oct 04, 2007 at 23:30 UTC | |
|
Re: optimizing a linear search by Indexed or bucketed hashing
by Cop (Initiate) on Oct 05, 2007 at 01:47 UTC | |
by GrandFather (Saint) on Oct 05, 2007 at 01:50 UTC | |
by Cop (Initiate) on Oct 05, 2007 at 03:46 UTC |