Update: Added a little more substance to the demonstration.

String::Approx and Text::Levenshtein may not be as useful as you you might expect for this as they operate on the string as an array of chars.

It's easy to see how "cunning stunt" and it's rude spoonerism would be rated as closely synonymous by such algorithms.

Perhaps a better approach, once you've extracted the raw text from the html, would be to search for the individual words from the quote, in the extracted text, and look for a high proportion of matches in close proximity.

If you stored the matches in a hash, wordPositionInText => word, then look for runs of consecutive, or nearly consecutive positions in the result, then you will find likely candidates easily. For example, match the edited quote "I want to get a little more sophisticated and allow the user to submit a mildly edited quote." against the text from your post:

#! perl -slw use strict; use List::Util qw[ reduce ]; use Data::Dump qw[pp]; sub normalise { local $_ = lc shift; tr[!"£$%^&*()-_=+'@;:#~/?\|`][]d; #" s[\s+][ ]g; return $_; } my $text = normalise do{ local $/; <DATA> }; my( $n, %wordPosns ) = 0; $wordPosns{ pos( $text )-1 } = ++$n while $text =~ m[\s+]g; my $quote = normalise 'I want to get a little more sophisticated and a +llow the user to submit a mildly edited quote.'; my @qwords = split ' ', $quote; my %matches; for my $word ( @qwords ) { while( $text =~ m[\b$word\b]g ) { $matches{ $wordPosns{ pos $text } } = $word ; } } #pp \%matches; my @runs = []; reduce { push @{ $runs[ -1 ] }, $a; push @runs, [] if defined $a and $a+1 < $b; $b; } sort{ $a <=> $b } keys %matches; @runs = grep @$_ > 3, @runs; #pp \@runs; print join ' ', map $matches{ $_ }, @$_ for @runs; __DATA__

produces:

C:\test>665700 i want to get a little more sophisticated i want the user to to submit a mildly edited quote

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: Verifying a quote matches (closely enough) a source URI by BrowserUk
in thread Verifying a quote matches (closely enough) a source URI by Your Mother

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.