in reply to Verifying a quote matches (closely enough) a source URI

Update: Added a little more substance to the demonstration.

String::Approx and Text::Levenshtein may not be as useful as you you might expect for this as they operate on the string as an array of chars.

It's easy to see how "cunning stunt" and it's rude spoonerism would be rated as closely synonymous by such algorithms.

Perhaps a better approach, once you've extracted the raw text from the html, would be to search for the individual words from the quote, in the extracted text, and look for a high proportion of matches in close proximity.

If you stored the matches in a hash, wordPositionInText => word, then look for runs of consecutive, or nearly consecutive positions in the result, then you will find likely candidates easily. For example, match the edited quote "I want to get a little more sophisticated and allow the user to submit a mildly edited quote." against the text from your post:

#! perl -slw use strict; use List::Util qw[ reduce ]; use Data::Dump qw[pp]; sub normalise { local $_ = lc shift; tr[!"£$%^&*()-_=+'@;:#~/?\|`][]d; #" s[\s+][ ]g; return $_; } my $text = normalise do{ local $/; <DATA> }; my( $n, %wordPosns ) = 0; $wordPosns{ pos( $text )-1 } = ++$n while $text =~ m[\s+]g; my $quote = normalise 'I want to get a little more sophisticated and a +llow the user to submit a mildly edited quote.'; my @qwords = split ' ', $quote; my %matches; for my $word ( @qwords ) { while( $text =~ m[\b$word\b]g ) { $matches{ $wordPosns{ pos $text } } = $word ; } } #pp \%matches; my @runs = []; reduce { push @{ $runs[ -1 ] }, $a; push @runs, [] if defined $a and $a+1 < $b; $b; } sort{ $a <=> $b } keys %matches; @runs = grep @$_ > 3, @runs; #pp \@runs; print join ' ', map $matches{ $_ }, @$_ for @runs; __DATA__

produces:

C:\test>665700 i want to get a little more sophisticated i want the user to to submit a mildly edited quote

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Verifying a quote matches (closely enough) a source URI
by Your Mother (Archbishop) on Feb 02, 2008 at 20:41 UTC

    As usual, you rule. I'll play around with that.

    Ah, how I love to see #" at the end of a line of someone else's code too. :)