I'm trying to take user submitted material (quotes, snippets) associated with a URI and then validate the material really is to be found on the given src-URI. This is ultimately an impossible problem domain needing editor intervention to be perfect, but I'm trying to see how close I can get before bringing in the "editor" (like spam filtering).
My code stub works for basic stuff. Psuedo-code: reject obviously malicious/bogus stuff outright (has links or XSS), fetch quote source (LWPx::ParanoidAgent), normalize quote and web content: strip HTML, normalize coding, lower case, strip special chars and spacing (b/c the stripped HTML can have an effect on that). I want to get a little more sophisticated, however. I want the user to be able to submit a mildly edited quote.
This might be the breakdown point. I'd love to allow things like pronoun switching for proper nouns in brackets, like editors often do (I know this is all but impossible but some fuzzy matching might allow for it without guaranteeing it). And short snips/elisions with ellipsis. The second I could image doing something like =~ s/\.\.\./[[:punct:][:alpha:]\s]{5,30}?/. So the converted regex would allow for some snippage.
I've got String::Approx and Text::LevenshteinXS available to maybe let some fudginess in but I'm having trouble thinking it through because the target (the quote) might be 50 chars while the src (the original page) might be 200k. Maybe use String::Approx to catch the match area and then Text::LevenshteinXS to see if its fuzziness (difference) is within configured limits?
I can get by with my basics and perhaps trying to allow the snips and then flagging things that fail for editors but I'm very curious what the algorithm/design tailors here might suggest instead or in addition.
Thanks!
(update: couple grammar, clarity edits).| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |