comment on

The math you do to correct this is the same as I had, but your positioning of it (in the if condition) is not.

I wasn't able to spend much time in analysing the optimal solution to the bug. Sorry.

If you intend comparing my code against other algorithms that rely upon all the keys in any given test having the same length, as your original does, and the version I've seen of ysth's extremely clever Xor/hash index hybrid does, then for a fair comparison, I should supply you with an updated version of mine that also relies upon this assumption.

Feel free to provide an updated version. If I havent started the new thread by the time you are ready to post then do it here. Also, while actually ysths solution and my current solution can both be modified to handle variable length string sets I dont think either of us have the time to actually put the changes into place so I would prefer that we stick to fixed length words. However if you want to post a version supporting variable length words then please treat the size provided to the constructor as the MINIMUM size of any string to be stored and I suppose youll get bonus points or something in the comparison. :-)

If the keys will be invarient length, and the other algorithms make this assumption, then that code need only be executed once, rather 100,000 scalar @keys times, and so should be lifted out of the for my $keys ( @keys )... loop.

I think this could either be done in an overloaded prepare() method if it only needs to be done once and should be done before the searching starts but after all the words to match are loaded. Since this is an OO framework you should be able to work out some form of caching pretty easily.

However, returning the results as a list, rather than by reference, seems a retrograde step as this implies replication of the results array at least once, with all the memory allocation and timecosts that implies, which from my breif purusal of the test harness, will be incorporated into the timings.

Ive changed the test framework a fair bit I think so wait for that. Regarding the list return I would have thought that since each routine would have basically the same overhead that it wouldnt matter that much. Nevertheless I'll try to switch things over if I have time. Howabout we say that fuzz_search should do a wantarray ? @results : \@results so that if I get the time to change the harness and the running implementations i have whatever you might be working on will take advantage of it?

Couldn't you record the index of the key, rather than it's value? Wouldn't either packing or stringifying the 3 records values reduce the overhead of the results array by 2/3rds and make for easy comparison?

I was actually thinking of using packed records as the return, so that each record would be a pack "NNA*",$string but for various reasons I think its not appropriate for these tests. And yes it does make it a lot easier for results comparison. But ive got that in the test harness and not in the object. :-)

You mentioned returning an index instead of a string which i avoided just for simplicty sakes. I think that the only way that would be workable would be to add an accessor method to the object for converting the number to a string. Which was one reason I avoided it. However if you are ok with that provision then I don't see why not. The base class can be updated to just do a lookup into the str_array, and then your solutions dont need an overloaded method for it at all. Doing the same to mine and ysths might be moderately harder but still doable. Ill work it in. (So the list will be triplets of longs) If we decide we want to use packed results for high density tests we can switch to pack "NNN". (BTW I say 'N' because it has the useful property of sorting lexicographically.)

I'm hoping I can get the tme to start the new thread this evening. Till then hopefully you have enough info to slot into what i publish when i do.

---
demerphq

In reply to Re^5: Fuzzy Searching: Optimizing Algorithm ( A few improvements). by demerphq
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.