comment on

Hello monks. I'm looking for some quasi-semantic wisdom on the following problem. I'd like to compare sentences to each other to see if they're the same, but making allowances for added/missing words or typos.

This maybe isn't a perl question in the strictest sense, but I'll be using perl to do it. I've considered various forms of diff, including WordDiff which is nice but not quite what I'm after. The algorithm I'm considering now does spot checks of substrings at random indices, but that also raises hard to answer questions about what constitutes an acceptable margin of error and I'm not sure if it will work very well in the wild.

The purpose is to get incoming text streams and compare them to a template to determine if the person is using the template or deviating from the template. In this application, people will be allowed and even encouraged to deviate from the template they're given to write, but I want to be able to determine when that's happening in real time.

One thing that should make the problem easier is that users should be either attempting to copy the template or clearly doing something else. The two behaviors should be quite clearly distinct and, to the eye, would be easily distinguishable. However, a human reader can judge the meaning of the sentence being evaluated and I think that's actually the first line of analysis that informs the rest (such as noticing typos).

Any general thoughts on algorithms to approach this problem with will be appreciated. Thank you very much.

In reply to comparing sentences by cntrtrst

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.