Your Mother has asked for the wisdom of the Perl Monks concerning the following question:
I'm trying to take user submitted material (quotes, snippets) associated with a URI and then validate the material really is to be found on the given src-URI. This is ultimately an impossible problem domain needing editor intervention to be perfect, but I'm trying to see how close I can get before bringing in the "editor" (like spam filtering).
My code stub works for basic stuff. Psuedo-code: reject obviously malicious/bogus stuff outright (has links or XSS), fetch quote source (LWPx::ParanoidAgent), normalize quote and web content: strip HTML, normalize coding, lower case, strip special chars and spacing (b/c the stripped HTML can have an effect on that). I want to get a little more sophisticated, however. I want the user to be able to submit a mildly edited quote.
This might be the breakdown point. I'd love to allow things like pronoun switching for proper nouns in brackets, like editors often do (I know this is all but impossible but some fuzzy matching might allow for it without guaranteeing it). And short snips/elisions with ellipsis. The second I could image doing something like =~ s/\.\.\./[[:punct:][:alpha:]\s]{5,30}?/. So the converted regex would allow for some snippage.
I've got String::Approx and Text::LevenshteinXS available to maybe let some fudginess in but I'm having trouble thinking it through because the target (the quote) might be 50 chars while the src (the original page) might be 200k. Maybe use String::Approx to catch the match area and then Text::LevenshteinXS to see if its fuzziness (difference) is within configured limits?
I can get by with my basics and perhaps trying to allow the snips and then flagging things that fail for editors but I'm very curious what the algorithm/design tailors here might suggest instead or in addition.
Thanks!
(update: couple grammar, clarity edits).
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Verifying a quote matches (closely enough) a source URI
by kyle (Abbot) on Feb 02, 2008 at 04:43 UTC | |
|
Re: Verifying a quote matches (closely enough) a source URI
by BrowserUk (Patriarch) on Feb 02, 2008 at 05:37 UTC | |
by Your Mother (Archbishop) on Feb 02, 2008 at 20:41 UTC | |
|
Re: Verifying a quote matches (closely enough) a source URI
by ww (Archbishop) on Feb 02, 2008 at 13:51 UTC | |
|
Re: Verifying a quote matches (closely enough) a source URI
by SuicideJunkie (Vicar) on Feb 03, 2008 at 02:33 UTC |