in reply to Re^2: Fuzzy Searching: Optimizing Algorithm Selection
in thread Fuzzy Searching: Optimizing Algorithm Selection

demerphq You are wrong.

How long were your "words"? Less than 25 characters?

For the two-miss scenario, matching each of 100,000 25-character needles against each of 30,000 x 1000-char strings requires:

326 * 100,000 * 976 * 30,000 comparisons = 954,528,000,000,000 comparisons.

Your 30,000,000 * 1275 = 38,250,000,000 comparisons.

Your rate of comparisons is 21,250,000/second.

Your runtime to run the 100,000 x 30,000 x 1000 is:

1 year, 5 months, 4 days, 21 hours, 29 minutes, 24.7 seconds.

For the 3-miss scenario the numbers are:

2626 * 100,000 * 976 * 30,000 = 7,688,928,000,000,000.

Your runtime to run the 100,000 x 30,000 x 1000 is:

11 years, 5 months, 20 days, 2 hours, 51 minutes, 45.88 seconds.

For the 4-miss scenario the numbers are:

28,252 * 100,000 * 976 * 30,000 = 82,721,856,000,000,000.

Your runtime to run the 100,000 x 30,000 x 1000 is:

123 years, 4 months, 9 days, 17 hours, 27 minutes, 3.46 seconds.

As for your challenge. I have published my code. Where is yours? Try applying your method to real world data.

If you have somewhere I can post the 1000 x 1000-char randomly generated sequences (994 K) and the 1000 x 25-char randomly generated search strings (27 kb) then I will send them to you so that you can time the process of producing the required information of:

Sequence no/ offset/ fuzziness (number of mismatched characters) that my published test code produces in 28.5 minutes.

Then, and only then, when we are comparing eggs with eggs, will there be any point in continuing this discussion.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
  • Comment on Re^3: Fuzzy Searching: Optimizing Algorithm Selection

Replies are listed 'Best First'.
Re^4: Fuzzy Searching: Optimizing Algorithm Selection
by demerphq (Chancellor) on Nov 12, 2004 at 21:19 UTC

    Im not at liberty to post my code. Sorry. All I can say is what I already have said, it is an Inline::C implementation of a Patricia tree optimized for handling a 10 character alphabet.

    Regarding all your bollocks in this reply, first off i was very clear in my node that i misunderstood the question and that my proposals only handled the issue of finding exact matches of the fuzzy word list. Not finding substrings. Since that is what you insulted and rubbished me about then that is what i am talking about.

    I have already stated that I ran my tests against live data (35 million telephone numbers) and that it took 0.00006 seconds per telephone numbers, or 1 million lookups per minute, or the full 35 million in just over 30 minutes. (This includes the overhead of reading the file and splitting the records by tabs.)

    Now comparing your algorithm against mine will mean that you only have to test to see if there is a fuzzy match from the first character. So if you want to provide a subroutine that i can run over my test data set then we are in business. But im not going to help you write it, and in fact im not going to do any more than run it to see. As it is I should be working on speeding up the site not sparring with someone who doesnt treat me with the decency I afford to him.

    ---
    demerphq

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re^4: Fuzzy Searching: Optimizing Algorithm Selection
by demerphq (Chancellor) on Nov 12, 2004 at 23:48 UTC

    Your math is all wrong.

    Your 30,000,000 * 1275 = 38,250,000,000 comparisons.

    I used an 11 digit key to generate my fuzzy list of 1275 words (i took a phone book and picked a number and generated all the one and two digit variants then searched against the book, the book was just under 35,000,000 numbers). Since the number of comparisons in a lookup into a Trie is the same as number of characters in the matched key the number of comparisons at absolute top used by the Trie based algorithm is thus 11 * N where N is the number of words in the dictionary. Thus the absolute maximum number of comparisons to search a 35 million word dictionary is 385 million character comparisons, in practice it will be signifgiantly less because of early bailouts.

    Oh another cool thing about the trie approach is that it could do fuzzy matches on a whole host of strings simultaneously for exactly the same work. So the op said he needs to run a large set of these matches. He could generate his fuzzy list for each word walk the dictionary and have matched the full set in one go.

    All of this assumes the thing i got wrong in the first place: that the match is start point anchored. I wasnt paying attention, and when i realized when I did my resubmit preview I added the bit starting "gah". Since i thought the ideas raised were interesting, and the qualifier present I didn't think it worth editing further. Maybe next time ill put in huge blinking letters "OOPS -- I ANSWERED THE WRONG QUESTION" so its more clear.

    ---
    demerphq

      So what your saying is that my estimate of the performance of your Trie implementation is wildly optimistic.

      Therefore the number of comparisons/second you achieved is less than the 21,250,000 figure I used as a divider in the rest of my calculations.

      Ergo: Your proposal would take even longer to run.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon