in reply to Splitting strings into words when there are no separators
I wouldn't do this in Perl1. I'd build a trie from the dictionary of possible search words (and stop words) and process the string much like the regex engine does, trying to greedily match the longest words in the dictionary as you inch along the string character-by-character, keeping a stack so you can backtrack when "yellowner" tries for "yellow" and fails on "ner" and so should try "yell" so it can match "owner".
Fuzzy matching gets even trickier, but I'd also handle that via backtracking. Off the top of my head, I'd probably have a an acceptable error rate for the current word that starts out at 0 and increases only if backtracking requires it to. And have some upper limit on number of errors both as an absolute number of errors per word (say 3 or 4) and as a fraction of the number of letters in the words (say 15% or 20%).
So, for each letter in your string, you look down the trie one letter until there aren't any longer words having what you've matched so far as a prefix. Each time you move over a letter, you push your state onto a stack so you can backtrack. Your state consists of the word prefix you've built up so far, the matching position in the trie, your position in the string, and how many errors you've tolerated and how many errors you are willing to tolerate for the current word.
When you backtrack a word choice the first time, you simply look for a shorter word. If you have to backtrack after that, then you increment the allowed number of errors. If, going forward again, you don't find a matching letter or can't use the obvious matching letter since that was what you tried the last time, then you (if you are allowed to have more errors for this word) look for an error instead. Common errors would be:
Note how the middle item is a combination of the first and last items. So you'd probably want to try that error first and then move on to the other two types after that.
Alternately, you could preload your dictionary with acceptable approximations to your search words, marked with their error score. If your dictionary of search words is not too large, then this would certainly be the way I would go. An efficient trie can hold quite a large dictionary. Allowing even a few simple errors balloons the number of "words" up very quickly, so you might want to make trade-offs on only allowing a very small number of errors or only allowing more likely errors (such as nearby letters phonetically or nearby on the keyboard or such) so that your dictionary stays small enough to be practical to process.
- tye
1 Perl is particularly slow at general character-at-a-time state machines -- though a mix of Perl and Inline::C could be considered, especially since demerphq already has trie support in such.
|
|---|