in reply to Re^5: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.)(Yes. Really!)
in thread Fuzzy Searching: Optimizing Algorithm Selection
The Xor algorithm does handle "several hundred thousand keys easily.
No the Xor algorith does NOT handle several hundred thousand keys easily. The rate of growth on the run time for searching a string 100k long is too high. Look at the numbers, the algorithm gets slower the more keys there are involved. So slow that you could construct Trie/DFA for a subset and run them in series and still outperform your algorithm. I chose the numbers I chose becuase if your algorithm blows out at those numbers it has already totally lost the game.
Look at the numbers. For a 10 x25 char words searching 100k strings my solution took 5.25 minutes. Your solution took half an hour. So you could run mine 6 times and yours still wouldnt have finished. Thus mine would do 60 words in 30 minutes while yours was still struggling with 10 words. Adding words appears to make your solution grow faster than mine.
Now mine could be made even more efficient. As I said the construction routine im using is far from optimal. Im pretty sure it can be made to run in linear time to the number of keys. Not only that but all of your memory calculation are like your other calculations: misleading. An optimized respresentation for this problem should require only 20 bytes per node (5 * length(long)). Consider a single empty hash and the reference to it (of which there are hundreds of thousands in use in my prototype code) takes more memory than that, at least double as an SV itself takes 28bytes. So the scale of problems this solution can handle in production quality code gets much much larger than the toy implementation here.
Anyway, I made an assertion in my original node in this thread. Ive now proved it with _real_ numbers and working code. You claimed it was nonsense. It has proved to not to be. And all of your wriggling and trying to rewrite the situation to make your claims true dont work. I made an assertion and my assertion has proved true. Within the parameters it is feasable to use a DFA/Trie it will smoke your Xor solution for cases where the string being searched is fairly long. Your objections to that claim have proved incorrect, as has your math.
You confidentally calculated that it would take years for a solution like this to run and you have been proved utterly wrong. Even if you went with the current code and ran it 100 times it still would only be 525 minutes to search for 1000 words over 1000x100k strings. Looking at the run time between 5 words and 10 words over 1000x100k strings for your solution we see a 1000 second increase. Since your operation will scale proportionally to the number of words and the length of the string we can conclude that to process 1000 words on the same data set will take your solution about 3000 minutes. Which is 2500 minutes LONGER than mine. Thats 41 hours longer. And you want to go this route with 100k words? Using the same logic 100k words would take your solution 20 million minutes, thats ~38 years. Versus the approximately 36 days it would take mine. I know which one i would prefer to wait for.
To make this picture even worse for your solution id like to point out that in my earlier code I was effectively doing $Fuzzy=5 by accident and it was still outpacing your solution. That added tens of thousands of words to the search list. If it was possible to do that with a pure perl soltion imagine what a String/Struct/Inline::C/Perl hybrid would do. Using a small fraction of the ram and doing the search operations in C means that my solution just gets faster.
All told I think its quite clear that I have proved my point in my original post. A state machine based scan will outperform anything posted in this thread. And it has. I know you will come back with more of the same here, but I'm no longer interested. You insulted me and you posted a bunch of nonsense math saying im all wrong. And now when i prove quite conclusively that you were wrong have you gone an updated any of that to point out its wrong? Have you added any caveats anywhere? Isnt this why you were insulting and abusive in the CB to me? You objected to me assertion my proposal was more efficient than yours and not being more clear in the caveat I added at the bottom, what about your nodes? They are full of totally nonsensical equations and arguments supposedly proving how my approach just cant work, but hey it outperforms yours. Thats a fact that you cant get around no matter how hard you try.
|
|---|