Update: Added a little more substance to the demonstration.
String::Approx and Text::Levenshtein may not be as useful as you you might expect for this as they operate on the string as an array of chars.
It's easy to see how "cunning stunt" and it's rude spoonerism would be rated as closely synonymous by such algorithms.
Perhaps a better approach, once you've extracted the raw text from the html, would be to search for the individual words from the quote, in the extracted text, and look for a high proportion of matches in close proximity.
If you stored the matches in a hash, wordPositionInText => word, then look for runs of consecutive, or nearly consecutive positions in the result, then you will find likely candidates easily. For example, match the edited quote "I want to get a little more sophisticated and allow the user to submit a mildly edited quote." against the text from your post:
#! perl -slw use strict; use List::Util qw[ reduce ]; use Data::Dump qw[pp]; sub normalise { local $_ = lc shift; tr[!"£$%^&*()-_=+'@;:#~/?\|`][]d; #" s[\s+][ ]g; return $_; } my $text = normalise do{ local $/; <DATA> }; my( $n, %wordPosns ) = 0; $wordPosns{ pos( $text )-1 } = ++$n while $text =~ m[\s+]g; my $quote = normalise 'I want to get a little more sophisticated and a +llow the user to submit a mildly edited quote.'; my @qwords = split ' ', $quote; my %matches; for my $word ( @qwords ) { while( $text =~ m[\b$word\b]g ) { $matches{ $wordPosns{ pos $text } } = $word ; } } #pp \%matches; my @runs = []; reduce { push @{ $runs[ -1 ] }, $a; push @runs, [] if defined $a and $a+1 < $b; $b; } sort{ $a <=> $b } keys %matches; @runs = grep @$_ > 3, @runs; #pp \@runs; print join ' ', map $matches{ $_ }, @$_ for @runs; __DATA__
produces:
C:\test>665700 i want to get a little more sophisticated i want the user to to submit a mildly edited quote
In reply to Re: Verifying a quote matches (closely enough) a source URI
by BrowserUk
in thread Verifying a quote matches (closely enough) a source URI
by Your Mother
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |