I wouldn't do this in Perl1. I'd build a trie from the dictionary of possible search words (and stop words) and process the string much like the regex engine does, trying to greedily match the longest words in the dictionary as you inch along the string character-by-character, keeping a stack so you can backtrack when "yellowner" tries for "yellow" and fails on "ner" and so should try "yell" so it can match "owner".

Fuzzy matching gets even trickier, but I'd also handle that via backtracking. Off the top of my head, I'd probably have a an acceptable error rate for the current word that starts out at 0 and increases only if backtracking requires it to. And have some upper limit on number of errors both as an absolute number of errors per word (say 3 or 4) and as a fraction of the number of letters in the words (say 15% or 20%).

So, for each letter in your string, you look down the trie one letter until there aren't any longer words having what you've matched so far as a prefix. Each time you move over a letter, you push your state onto a stack so you can backtrack. Your state consists of the word prefix you've built up so far, the matching position in the trie, your position in the string, and how many errors you've tolerated and how many errors you are willing to tolerate for the current word.

When you backtrack a word choice the first time, you simply look for a shorter word. If you have to backtrack after that, then you increment the allowed number of errors. If, going forward again, you don't find a matching letter or can't use the obvious matching letter since that was what you tried the last time, then you (if you are allowed to have more errors for this word) look for an error instead. Common errors would be:

Note how the middle item is a combination of the first and last items. So you'd probably want to try that error first and then move on to the other two types after that.

Alternately, you could preload your dictionary with acceptable approximations to your search words, marked with their error score. If your dictionary of search words is not too large, then this would certainly be the way I would go. An efficient trie can hold quite a large dictionary. Allowing even a few simple errors balloons the number of "words" up very quickly, so you might want to make trade-offs on only allowing a very small number of errors or only allowing more likely errors (such as nearby letters phonetically or nearby on the keyboard or such) so that your dictionary stays small enough to be practical to process.

- tye        

1 Perl is particularly slow at general character-at-a-time state machines -- though a mix of Perl and Inline::C could be considered, especially since demerphq already has trie support in such.


In reply to Re: Splitting strings into words when there are no separators (trie & !perl) by tye
in thread Splitting strings into words when there are no separators by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.