Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Smart Comparison of Buffy Strings

by Cody Pendant (Prior)
on Feb 03, 2002 at 00:23 UTC ( [id://142994]=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

This is a more or less theoretical question, but I hope that fellow monks might find it interesting.

I'm working on a database of quotes (from Buffy The Vampire Slayer).

I want people to be able to add a quote to the database, but I also want to check the quote isn't already there.

So say I have this quote in the database:

I walk. I talk. I shop, I sneeze. I'm gonna be a fireman when the floods roll back. There's trees in the desert since you moved out. And I don't sleep on a bed of bones.

what if someone comes along and enters an abbreviated version -- say they think it's better just as:

I'm gonna be a fireman when the floods roll back. There's trees in the desert since you moved out.

or what if their spelling/interpretation differs slightly:

I'm going to be a fireman when the floods roll back. ^^^^^^^^^^^

I would like to be able to somehow compare the strings in such a way that the fact that the quote is already entered shows up, despite the fact that it's not exactly the same, either shorter or longer or slightly differently written in some aspects.

I can't think of a smart way to do this, though of course you could just compare all possible substrings of the two strings, which seems overly mechanical.

The result should presumably be a percentage score or similar?

Thanks in Advance,
CP

--
Weaselling out of things is important. It's what separates us from the animals ... except the weasel.

Replies are listed 'Best First'.
Re: Smart Comparison of Buffy Strings
by dws (Chancellor) on Feb 03, 2002 at 00:57 UTC
    I would like to be able to somehow compare the strings in such a way that the fact that the quote is already entered shows up, despite the fact that it's not exactly the same, either shorter or longer or slightly differently written in some aspects.

    I've seen a simple approach to a similar problem. It went something like:

    1. Convert each string to an array of words.
    2. Delete "noise" words ("a", "the", "to", etc.) from each array.
    3. Convert each remaining word into a "canonical" form (e.g., all lower-case, apostrophe's deleted.)
    4. Compute a score based on the Hamming Distance between the source array is to the target array. (Hamming distance is a measure of how close two strings are to each other. In one form*, it considers transpositions and deletions.)
    The two tricky parts are settling on a cannonical form, which can get quite involved if you want to consider stemming, and the Hamming distance calculation.

    Try the simple approach first: convert all words to lower-case, drop noise words, then see how close the two arrays are in terms of common elements in a common order.

    ----
    *I did a quick google search. Most descriptions talk in terms of Hamming's original definition, which was in terms of bit flips. I've seen this applied to strings of arbitrary data somewhere -- perhaps in an Algorithms text. The string comparison that limewire uses to guess at whether a search string matches an MP3 title might also be worth checking, though it's in Java.

Re: Smart Comparison of Buffy Strings
by blakem (Monsignor) on Feb 03, 2002 at 01:37 UTC
    I think String::Approx might be a good place to start... The Acme::Buffy module is probably a bad place to start. ;-)

    -Blake

(cLive ;-) Re: Smart Comparison of Buffy Strings
by cLive ;-) (Prior) on Feb 03, 2002 at 03:15 UTC
    You might want to convert the strings to soundex and then compare them:
    $longest = soundex of longer string $shortest = soundex of shorter string if ($longest =~ /$shortest/) { ... }
    as a quick check. Then do the same for un-soundex versions. if not exact, perhaps try exact word s/// - if no word chars left, assume it's the same quote (or subset of).

    cLive ;-)

Re: Smart Comparison of Buffy Strings
by drinkd (Pilgrim) on Feb 03, 2002 at 13:50 UTC
    I like the soundex approach, but it seemed to me that a XOR of the strings is the more elegant way, then look for a series of 1's longer than some threshold.

    A Super Search came back with This very similar thread, except searching a whole database for such sorta-matches. Sure enough, the consensus winner was This XOR solution from japhy, although there were many other good sugestions.

    drinkd

      In the node you cite, the goal was to ensure that two strings are "differnet enough". In this node, the goal is to determine if two nodes are "similar enough"

      That's an important distinction.

      japhy's solution considers two strings different if their lengths are significantly different. I'm fairly certain that's not what Cody wants. If one quote is a complete subset, or a complete superset of another, with only an few subtle spelling mistakes, it should still be considered "similar enough". I would definitely go with String::Approx and set the 'I' and 'D' modifiers very high to allow a large number of Inserts and Deletes (and of course: the 'i' modifier for case insensativeity). You'll have to really play with the "approximateness percentage" to ensure that it allows enough subtle word spelling differences, but doesn't complain that every quote that uses the word "the" is the same.

      And of course: I wouldn't recomend this to flat out reject any quote, just to flag it as a potential duplicate.

Re: Smart Comparison of Buffy Strings
by dws (Chancellor) on Feb 05, 2002 at 07:57 UTC
    Here's the start of another approach, based on using Algorithm::Diff to count the number of common words (in sequence) between two arrays of strings. By computing a percentage based on the smaller of the arrays, you address both the case of the submitted quote being a subset of an existing quote, and vice versa.

    Here's how it hands the "gonna" -> "going to" change.

    use strict; use Algorithm::Diff qw(traverse_sequences); my @quote1 = split(' ', <<QUOTE); I walk. I talk. I shop, I sneeze. I'm gonna be a fireman when the floods roll back. There's trees in the desert since you moved out. And I don't sleep on a bed of bones. QUOTE my @quote2 = split(' ', <<QUOTE); I'm going to be a fireman when the floods roll back. There's trees in the desert since you moved out. QUOTE my $match = 0; traverse_sequences(\@quote1, \@quote2, { MATCH => sub { $match++ } } ); my $smaller = @quote1 < @quote2 ? @quote1 : @quote2; my $percent_match = int(($match / $smaller) * 100); print "$match words matched ($percent_match%)\n"; __END__ 18 words matched (90%)
    To make this work in practice, you would need to do things like ensure that the shorter quote was at least n words long, so that a 1 word candidate quote didn't match at 100%.

    A few minutes perusing the Algorithm::Diff POD might give you some additional ideas.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://142994]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-18 12:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found