demerphq You are wrong.
How long were your "words"? Less than 25 characters?
For the two-miss scenario, matching each of 100,000 25-character needles against each of 30,000 x 1000-char strings requires:
326 * 100,000 * 976 * 30,000 comparisons = 954,528,000,000,000 comparisons.
Your 30,000,000 * 1275 = 38,250,000,000 comparisons.
Your rate of comparisons is 21,250,000/second.
Your runtime to run the 100,000 x 30,000 x 1000 is:
1 year, 5 months, 4 days, 21 hours, 29 minutes, 24.7 seconds.
For the 3-miss scenario the numbers are:
2626 * 100,000 * 976 * 30,000 = 7,688,928,000,000,000.
Your runtime to run the 100,000 x 30,000 x 1000 is:
11 years, 5 months, 20 days, 2 hours, 51 minutes, 45.88 seconds.
For the 4-miss scenario the numbers are:
28,252 * 100,000 * 976 * 30,000 = 82,721,856,000,000,000.
Your runtime to run the 100,000 x 30,000 x 1000 is:
123 years, 4 months, 9 days, 17 hours, 27 minutes, 3.46 seconds.
As for your challenge. I have published my code. Where is yours? Try applying your method to real world data.
If you have somewhere I can post the 1000 x 1000-char randomly generated sequences (994 K) and the 1000 x 25-char randomly generated search strings (27 kb) then I will send them to you so that you can time the process of producing the required information of:
Sequence no/ offset/ fuzziness (number of mismatched characters) that my published test code produces in 28.5 minutes.
Then, and only then, when we are comparing eggs with eggs, will there be any point in continuing this discussion.
In reply to Re^3: Fuzzy Searching: Optimizing Algorithm Selection
by BrowserUk
in thread Fuzzy Searching: Optimizing Algorithm Selection
by Itatsumaki
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |