Why is this table in the root node?

Seems like as good a place as any. Ill move it to a reply if you would prefer althought i dont see why.

Why put the original, superceded Xor in that table and not the original Trie?

I used your interpretation of the result requirements. Since by that interpretation my original _proof_of_concept_ implementation is not up to scratch I didnt see any point in including it. I could include whole hosts of solutions that dont meet the requirements but it wouldnt seem to add much to the debate. Also I used Xor.pm as the baseline implementation because of some comments you made in the other thread. Any proposed solution has to get the same results as Xor.pm or its probably broken. Thats one very nice advantage of a brute force algorithm, you can use them to test more sophisticated approaches.

Why the arbitrary 100 second cutoff?

Because with that time restriction we can see the relative performance of the different implementations without having to wait forever to complete a test run. What utility would have been added to this comparison by converting those N/A's into large numbers? Id say not that much. Feel free to run the test suite with a larger limit, or without a limit at all. I suspect youll get extremely bored of waiting for the Xor2? variants to complete on the larger test sets. I played aorund with 60 seconds, and also larger limits like 200 and it didnt seem to make much difference. 100 seconds seemed a reasonable limit.

Why not show the build time for the DFA solution--which isn't required by any of the other solutions in the table.

The time shows the total time, search time plus construct time to process a test file. Run the test suite if you want more precise times. Alternatively i can follow up with some additional data if you feel its necessary. And Ysth's algorithm has a prepare phase that is under some cirucumstances faster than the DFA but for others slower. Including the prepare time in the totals seemed the only fair way to do it, even though it disadvantages Ysth.pm and DFA.pm as they both can reuse their prepare time on later searches. For instance both objects could be provided with means to self serialize themselves and save most of the construct time. In order to avoid accusations of being unfair I didnt allow for such time saving in the comparison. Should I change things to allow for this?

What is the procedure for updating the OP to include data from improved, corrected implementations?

You can either reply or I can arrange to change the ownership of Xor.pm or Xor2.pm to you. Personally i think a reply might be better as itll allow people to see how the algorithm has evolved. But its up to you.

Update:Doh. You mean the chart. Post your results and if it seems like it makes more sense to merge them into my chart than leave them as posted then i will. Just tell me what you want to do when you post the results. Also, if you come up with new variants maybe just give them new names so they can be independently tested against the rest.

Where are the memory consumption statistics?

Not aggregated into that chart. (Sorry I didnt think of it.) When the tests run each objects memory utilization is printed out. Please just run test_fuzz.pl as appropriate. Ill redo the test run today and change the code to include memory utilization in the output. On my machine Ysth loses that fight pretty much every time.

Why is the test harness so horribly complicated?

The harness is for doing bulk comparisons of different algorithms. Its also designed to ensure that the first string searched in each file is cross compared to the other result checks. Not only that but its designed to test in such a way that each test file/class combination is executed in its own process space to ensure that one algorithm cant contaminate the results of the next. A naive Benchmark for instance wouldnt do this. As an example, if DFA runs first and allocates 2MB for itself and then Xor runs Xor gets to reuse the memory used by DFA, saving itself a chunk of resource allocation costs. By executing each test run in its own process we dont have to worry about this problem.

And IMO the test harness isnt that complicated. Must of the "crap" in there is for printing out nice charts of the results. Chop all of that out and test_fuzz.pl goes to half the size. Also its not like anybody directly involved in this showdown is a novice programmer so they can review tinker with and change the test suite as they like. For instance I was kinda hoping you would follow up with patches to make_test.pl so that it could be used to produce "high density return" test sets that you said would favour your algorithm. I could have done it myself but at a certain point i think its better to publish and wait for response than try to anticipate every possibile critique that can come up.

This is inherently a functional interface. All the rest is cruft.

I disagree. In order to meaningfully compare different implementations with different requirements its necessary to set a baseline where the most participants can play. Also IMO its much more likely that if this code gets released into the wild it would be done so as an OO module. For instance how useful is your proposed interface if you want to use data from a DB? Or generate it computationally? Id say not at all.

Needless to say, I think the test methodology is flawed. 533 lines of code to do

So would that actually test the algorithms? If I had used that then the orignal instance of Xor2 would have seemd to pass test, when it fact it failed test and had a bug. I mean if we just use benchmark for running our tests then my solution would be fuzz_search { return }. Id like to see you beat that :-) I think if you want to come up with a different test framework then do so. Please just post it here so we can review it. But if it doesnt actually test the return results then I dont think Ill consider it much of an improvement.

More importantly, the selective presentation of statistics in the root node of the "shootout", comparing old, first-cut code, written in an hour and long since improved and superceded; with code hand crafted, in C over a 10 days, using someone else brilliant, original idea, has devalued this exercise completely.

I dont see whats selective here. I have presented all of the working results I have, as well as all of the code involved. Please feel free to take as long as you like to come up with something better. And yes ysth made a brilliant observation, and yes i was happy to take advantage of it as it presented the perfect way to integrate my proposal (using a DFA as much as possible) and still meet your "every possible match at every possible offset" requirement. If presenting a comprehensive summary of four solutions to this problem "devalues" the exercise then I suspect we have different motives here. My motive was to find an optimal soltuion to the problem at hand not prove that any given proposal of mine was optimal. (Although I will admit to a certain satisfaction that it turns out a solution using a DFA appears to win this fight I would still have published regardless.)

---
demerphq


In reply to Re^2: Algorithm Showdown: Fuzzy Matching by demerphq
in thread Algorithm Showdown: Fuzzy Matching by demerphq

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.