I'm trying to take user submitted material (quotes, snippets) associated with a URI and then validate the material really is to be found on the given src-URI. This is ultimately an impossible problem domain needing editor intervention to be perfect, but I'm trying to see how close I can get before bringing in the "editor" (like spam filtering).

My code stub works for basic stuff. Psuedo-code: reject obviously malicious/bogus stuff outright (has links or XSS), fetch quote source (LWPx::ParanoidAgent), normalize quote and web content: strip HTML, normalize coding, lower case, strip special chars and spacing (b/c the stripped HTML can have an effect on that). I want to get a little more sophisticated, however. I want the user to be able to submit a mildly edited quote.

This might be the breakdown point. I'd love to allow things like pronoun switching for proper nouns in brackets, like editors often do (I know this is all but impossible but some fuzzy matching might allow for it without guaranteeing it). And short snips/elisions with ellipsis. The second I could image doing something like =~ s/\.\.\./[[:punct:][:alpha:]\s]{5,30}?/. So the converted regex would allow for some snippage.

I've got String::Approx and Text::LevenshteinXS available to maybe let some fudginess in but I'm having trouble thinking it through because the target (the quote) might be 50 chars while the src (the original page) might be 200k. Maybe use String::Approx to catch the match area and then Text::LevenshteinXS to see if its fuzziness (difference) is within configured limits?

I can get by with my basics and perhaps trying to allow the snips and then flagging things that fail for editors but I'm very curious what the algorithm/design tailors here might suggest instead or in addition.

Thanks!

(update: couple grammar, clarity edits).

In reply to Verifying a quote matches (closely enough) a source URI by Your Mother

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.