Update:Please dont forget to note the sentence at the end which points out that I got muddled and was thinking of only when the string was matched at the start. Sorry.

I might approach the problem by first converting my search string into all of the possible other strings that I will consider to be a fuzzy match. Then build a trie to represent this structure. Then use the trie to search the dictionaries. (You may want to code the Trie in Inline::C.) This will outperform regexes in my experience (Trie lookup is pretty well optimal, O(KeyLen)~O(1),). If i had to do these searches a lot and the size of my alphabet was reasonable, i would look at producing fingerprints of my dictionary (ie, all words with the same letters in them (regardless of counts) would have the same fingerprint"). Assuming you have a fast way to get the list associated with a given fingerprint you should have cut your search space vastly as a single XOR against the fingerprint will tell you if a match is possible. Even if the alphabet was small (like ACTG) you could use tuples for your fingerprinting algorithm, so for instance assuming you use an integer for the fingerprint you could assign one bit to each possible combination.

I could probably put together an implementation that worked on a base 10 alphabet, (ie Numeric Digits) and searched a 30 million record dictionary for matches if you provided a routine to generate the match list. Unfortunately i couldnt show you the code, but the results may tell you if the techniques were appropriate.

Incidentally afaict the upper time of this approach would be 25 * N where N is the number of words in your dictionary. Assuming you fingerprint then the time will be much lower than that. Thus the algorithm would be O(N). Which im pretty sure is faster than anything else posted here.

An alternative approach would be to store all of your dictionaries in one great big trie and then reverse the process, looking up each fuzzy word in the trie and recording the words matched. This would only be suitable if the density of the data stored in the dictionaries was high and/or memory was abundant, but would have the advantage of being almost instantaneous to search. Assuming that 1 word had 300 fuzzy equivelents the structure should be searchable in 300*25 timesteps regardless of dictionary size.

Gah, I jsut realized that the above only applies if you need to match from the front. If you need to match from any position in the dictionary words then you need to go beyond a trie.

---
demerphq


In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection by demerphq
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.