Lots of good stuff. Thank you all for your inputs.
A couple of you asked for samples. There’s nothing unusual about the texts that will be used. I plan to test with texts from wikipedia by introducing misspellings, deletions, additions and changes in punctuation. Of course that continues to beg the question, how much can you alter a sentence before it becomes something else? Maybe I should be asking a different kind of monk about that. :)
Nevertheless I’ve included some texts below just to give a broad sense of what I expect to see. These are all from: https://en.wikipedia.org/wiki/Human_rights.
There’s flexibility on the question of how far into the text the algorithm has to be able to make a determination. Probably sentence by sentence as a first approximation.
Certainly Levenshtein distance looks worthy of study and String::Approx looks very interesting as well, along with a few more suggestions made in the String::Approx description on cpan. I’ll have to experiment will all of this and see where it gets me. And I have to beg your pardon - it could take a while to be able to comment further on these suggestions.
> And if you want to go hardcore on the problem: “wordnet”
It’s not so far fetched. At least, some effort to do grammatical parsing or look at sentence structure could be helpful. I’ve had good experiences with Lingua/LinkParser and it can be a way to look at the abstraction of the sentence instead of at the sentence itself, though it's probably too much overhead for this application.
|