in reply to Re^3: Fuzzy Searching: Optimizing Algorithm ( A few improvements).
in thread Fuzzy Searching: Optimizing Algorithm Selection
... this code fails the "A" x 10/"A" x 11 test...
Yes. As I pointed out to ysth offline a couple of days ago, if the sequences are not an exact multiple of the key length, then the code produces some extra, erroneous matches.
I've update the post and code above to note and correct that.
The math you do to correct this is the same as I had, but your positioning of it (in the if condition) is not. Once the math determines that the end of the sequence has been reached, there is no point in allowing the loop to continue. It simply produces further bad matches and wastes time. The conditional last statement in the corrected code above does this.
A few further comments re the test harness.
As coded above, the algorithm recalculates various pieces of information derived from the length of the keys, on each iteration of the key-processing loop. If the keys will be invarient length, and the other algorithms make this assumption, then that code need only be executed once, rather 100,000 scalar @keys times, and so should be lifted out of the for my $keys ( @keys )... loop.
However, returning the results as a list, rather than by reference, seems a retrograde step as this implies replication of the results array at least once, with all the memory allocation and timecosts that implies, which from my breif purusal of the test harness, will be incorporated into the timings.
Probably pretty even for all algorithms on any given run, but implying a disproportionate overhead on the quicker algorithms on any high density tests.
Whilst I have had the same results from randomly generated sequences as ysth reported: that matches are sparsely populated in these randomly generated sequences. In a generic fuzzy-match application unrelated to the OPs problem, this may not be the case. So any comparisons should also include high density datasets also.
My algorithm does particularly well in this regard relative to others, so this is me watching out for number one here, but I think a generic solution should cater for this scenario..
I agree that further investigations of this should be done in a new thread ,divorced from association with some parts of this one. I'll let you start it and I'll post an, adapted to your testharness, version of the latest incarnation of my code there, once I see what if any changes occur in the test harness that I need to adapt it to.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Fuzzy Searching: Optimizing Algorithm ( A few improvements).
by demerphq (Chancellor) on Dec 09, 2004 at 09:00 UTC | |
by ysth (Canon) on Dec 09, 2004 at 11:42 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 12:44 UTC | |
by BrowserUk (Patriarch) on Dec 09, 2004 at 10:11 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 12:33 UTC | |
by BrowserUk (Patriarch) on Dec 09, 2004 at 12:41 UTC | |
by demerphq (Chancellor) on Dec 09, 2004 at 12:56 UTC |