Your Mother has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to take user submitted material (quotes, snippets) associated with a URI and then validate the material really is to be found on the given src-URI. This is ultimately an impossible problem domain needing editor intervention to be perfect, but I'm trying to see how close I can get before bringing in the "editor" (like spam filtering).

My code stub works for basic stuff. Psuedo-code: reject obviously malicious/bogus stuff outright (has links or XSS), fetch quote source (LWPx::ParanoidAgent), normalize quote and web content: strip HTML, normalize coding, lower case, strip special chars and spacing (b/c the stripped HTML can have an effect on that). I want to get a little more sophisticated, however. I want the user to be able to submit a mildly edited quote.

This might be the breakdown point. I'd love to allow things like pronoun switching for proper nouns in brackets, like editors often do (I know this is all but impossible but some fuzzy matching might allow for it without guaranteeing it). And short snips/elisions with ellipsis. The second I could image doing something like =~ s/\.\.\./[[:punct:][:alpha:]\s]{5,30}?/. So the converted regex would allow for some snippage.

I've got String::Approx and Text::LevenshteinXS available to maybe let some fudginess in but I'm having trouble thinking it through because the target (the quote) might be 50 chars while the src (the original page) might be 200k. Maybe use String::Approx to catch the match area and then Text::LevenshteinXS to see if its fuzziness (difference) is within configured limits?

I can get by with my basics and perhaps trying to allow the snips and then flagging things that fail for editors but I'm very curious what the algorithm/design tailors here might suggest instead or in addition.

Thanks!

(update: couple grammar, clarity edits).

Replies are listed 'Best First'.
Re: Verifying a quote matches (closely enough) a source URI
by kyle (Abbot) on Feb 02, 2008 at 04:43 UTC

    I used to work at a publishing services company editing the text of books. If I was looking at a page with an editor's mark, and I wanted to find that place in the file, I'd look for a nearby unusual word or phrase and search for that in the file.

    So my suggestion is to focus on the parts of each text that make them most unique. Take a word histogram of the original page text and look for the least common words in the quote. From there, you can look nearby in the original page and see if other low frequency words are nearby.

    I can't tell from your problem description whether you're more concerned with false positives or false negatives. I think the algorithm I'm describing could be fairly tricky, and probably hard to tune to a particular false detection rate. Still, depending on the nature of the texts you have, it might pick up things other matching methods miss.

Re: Verifying a quote matches (closely enough) a source URI
by BrowserUk (Patriarch) on Feb 02, 2008 at 05:37 UTC

    Update: Added a little more substance to the demonstration.

    String::Approx and Text::Levenshtein may not be as useful as you you might expect for this as they operate on the string as an array of chars.

    It's easy to see how "cunning stunt" and it's rude spoonerism would be rated as closely synonymous by such algorithms.

    Perhaps a better approach, once you've extracted the raw text from the html, would be to search for the individual words from the quote, in the extracted text, and look for a high proportion of matches in close proximity.

    If you stored the matches in a hash, wordPositionInText => word, then look for runs of consecutive, or nearly consecutive positions in the result, then you will find likely candidates easily. For example, match the edited quote "I want to get a little more sophisticated and allow the user to submit a mildly edited quote." against the text from your post:

    #! perl -slw use strict; use List::Util qw[ reduce ]; use Data::Dump qw[pp]; sub normalise { local $_ = lc shift; tr[!"£$%^&*()-_=+'@;:#~/?\|`][]d; #" s[\s+][ ]g; return $_; } my $text = normalise do{ local $/; <DATA> }; my( $n, %wordPosns ) = 0; $wordPosns{ pos( $text )-1 } = ++$n while $text =~ m[\s+]g; my $quote = normalise 'I want to get a little more sophisticated and a +llow the user to submit a mildly edited quote.'; my @qwords = split ' ', $quote; my %matches; for my $word ( @qwords ) { while( $text =~ m[\b$word\b]g ) { $matches{ $wordPosns{ pos $text } } = $word ; } } #pp \%matches; my @runs = []; reduce { push @{ $runs[ -1 ] }, $a; push @runs, [] if defined $a and $a+1 < $b; $b; } sort{ $a <=> $b } keys %matches; @runs = grep @$_ > 3, @runs; #pp \@runs; print join ' ', map $matches{ $_ }, @$_ for @runs; __DATA__

    produces:

    C:\test>665700 i want to get a little more sophisticated i want the user to to submit a mildly edited quote

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      As usual, you rule. I'll play around with that.

      Ah, how I love to see #" at the end of a line of someone else's code too. :)

Re: Verifying a quote matches (closely enough) a source URI
by ww (Archbishop) on Feb 02, 2008 at 13:51 UTC
    Excellent thoughts above and ++ to both and to OP. Sorry not to have any algorithm suggestions of the quality of the OP and the answers to date.

    That said, your desire to allow the user "to submit a mildly edited quote" reaches deep into the subjective... and thus into a province where human cognition still performs better than automated "judgment." BrowserUK's rude spoonerism and kind might be dealt with by adding a stop-list... but we've seen that battle fought and lost. And what might we do to deal with a case where the submitter modifies the original text (containing a word or phrase which is in itself unobjectionable) so that its meter, rhyme or other characteristic --- by virtue of the modification -- suggests an objectionable spoonerism?

    For me, though, even bigger concerns arise in the area of accuracy and context. As we've seen recently in (US presidential) politics and elsewhere, an accurately quoted but out of context fragment of an utterance can turn the intent of the original upon its head. So too can what automata might well consider mild editing: for example: a quote which adds a negative or something on the order of s/"We have nothing to fear but fear itself."/"We have to fear fear itself."/)

    Thus, though YMMV may vary with your intended application, /me believes that in many cases the tool you're building may -- given the state-of-the-art, today -- still require the application of human eyeballs and judgement in many or most cases.

    Update: added "and to OP." /me has no excuse for omitting that in the original.

Re: Verifying a quote matches (closely enough) a source URI
by SuicideJunkie (Vicar) on Feb 03, 2008 at 02:33 UTC

    How about parsing the quote into grammatical pieces? You could then compare the parse trees for similarity in addition to looking at what text has changed.

    If you can't parse the sentence at all, that's a good hint that it is spam. Otherwise, having a tree handy would make it easy to provide hints and highlights for the human editor to look at, and save their time.