Hello monks. I'm looking for some quasi-semantic wisdom on the following problem. I'd like to compare sentences to each other to see if they're the same, but making allowances for added/missing words or typos.
This maybe isn't a perl question in the strictest sense, but I'll be using perl to do it. I've considered various forms of diff, including WordDiff which is nice but not quite what I'm after. The algorithm I'm considering now does spot checks of substrings at random indices, but that also raises hard to answer questions about what constitutes an acceptable margin of error and I'm not sure if it will work very well in the wild.
The purpose is to get incoming text streams and compare them to a template to determine if the person is using the template or deviating from the template. In this application, people will be allowed and even encouraged to deviate from the template they're given to write, but I want to be able to determine when that's happening in real time.
One thing that should make the problem easier is that users should be either attempting to copy the template or clearly doing something else. The two behaviors should be quite clearly distinct and, to the eye, would be easily distinguishable. However, a human reader can judge the meaning of the sentence being evaluated and I think that's actually the first line of analysis that informs the rest (such as noticing typos).
Any general thoughts on algorithms to approach this problem with will be appreciated. Thank you very much.
In reply to comparing sentences by cntrtrst
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |