in reply to Comparing inputs before inserting into MySQL for "sounds-like" similarity
This is a fascinating problem and there is a very effective solution. I implmented a version of it in C (code since lost) some years ago based on research paper I can't find anymore! Sorry.
I will describe the problem, and the technique and allow you to find the paper. I have done a very quick perl implementation to demonstrate the basics of how it works.
The problem with finding similar, but not exact document duplicates in a large document collection is that any naive implementation is On^2 ie you have to compare each new document with every old document so as the number of documents climbs this becomes impossible to do in finite time.
Mission impossible. Not so. Here is the algorithm.
Now for any new document we want to test against our set we tokenise and get our top 20 randomly selected tokens. For disimilar documents there will be say 0-2 matches. The probability of a near identical document increased geometrically for each match found. The odd collision is imaterial to the logic.
Case and punctuation are removed from the equation on entry. Changing paragraph order only affects the few tokens that overlap the begining and end of the paragraph. Changing sentence order is similar but has a greater effect because the number of identical tokens we generate depends on how many full frames fit in that sentence. Deletion/insertion or changing single words only affect the frames that include it ie if our frame width is 4 it will affect only 4 tokens.
Here is some demo code. Note there is no need to hash, we simply do it to pack a given token into a fixed unique(ish) space and also make sure when we sort and pick the selection is completely random. Note how the sort pulls up the numerical items first so it is not as random as we want but good enough for a demo.
#!/usr/bin/per -w use strict; my $text1 = <<TEXT; Once upon a time, in a land far away, there lived a giant. .. . He was big. Really big. He had 5.5 toes on 2.3 feet. TEXT my $text2 = <<TEXT; Once upon a time, in a land far away, there lived a GIANT. He was big!!!!!!!!!!!!! YOU GOTTA BELIEVE ME. Really big! He was HUGE I tell you. He had 5.5 toes on 2.3 feet. TEXT my $FRAME = 4; print "Here are the texts:\n\n$text1\n\n$text2\n\n"; my $res1 = tokenise($text1); my $res2 = tokenise($text2); print "Here is the walking token frame for text1:\n\n"; print "$_\n" for @$res1; print "\n\nAnd here are the matches:\n\n"; my $top_twenty1 = get_top($res1); my $top_twenty2 = get_top($res2); my $same; $same->{$_}++ for @$top_twenty1, @$top_twenty2; my $matches = grep{ $same->{$_} == 2 } keys %$same; printf "%20s %1s | %1s %-20s\n", $top_twenty1->[$_], $same->{$top_twenty1->[$_]} == 2 ? "*" : " ", $same->{$top_twenty2->[$_]} == 2 ? "*" : " ", $top_twenty2->[$_] for 0..19; print "\nFound $matches matches!\n"; sub tokenise { my $text = shift; $text = lc($text); $text =~ tr/a-z0-9/ /c; my @text = split " ", $text; my @tokens; push @tokens, join " ", @text[$_..($_+$FRAME-1)] for 0..(@text-$FR +AME); return \@tokens; } sub get_top { my $ary = shift; [ (sort @$ary)[0..19] ]; } __DATA__ Here are the texts: Once upon a time, in a land far away, there lived a giant. .. . He was big. Really big. He had 5.5 toes on 2.3 feet. Once upon a time, in a land far away, there lived a giant. .. . He was big!!!!!!!!!!!!! You gotta believe me. Really big! He was huge I tell you. He had 5.5 toes on 2.3 feet. Here is the walking token frame for text1: once upon a time upon a time in a time in a time in a land in a land far a land far away land far away there far away there lived away there lived a there lived a giant lived a giant he a giant he was giant he was big he was big really was big really big big really big he really big he had big he had 5 he had 5 5 had 5 5 toes 5 5 toes on 5 toes on 2 toes on 2 3 on 2 3 feet And here are the matches: 5 5 toes on * | * 5 5 toes on 5 toes on 2 * | * 5 toes on 2 a giant he was * | * a giant he was a land far away * | * a land far away a time in a * | * a time in a away there lived a * | * away there lived a big he had 5 | believe me really big big really big he | big he was huge far away there lived * | big you gotta believe giant he was big * | * far away there lived had 5 5 toes * | * giant he was big he had 5 5 * | gotta believe me really he was big really | * had 5 5 toes in a land far * | * he had 5 5 land far away there * | he was big you lived a giant he | he was huge i on 2 3 feet | huge i tell you once upon a time | i tell you he really big he had | * in a land far there lived a giant | * land far away there Found 12 matches!
So that is the basic concept. Now you can tune this in various ways. A wide frame make the algorithm more sensitive to sentence order changes and insertion/deletions. The frame should be wide enough to miss common phrases, but less wide than the average sentence. The algorithm works best at detecting similar documents of similar lengths. If you have a 20k document and some one lifts 1k out if it we would only expect a single match, because we will probably have sampled 19 tokens from the 19k not lifted. If we increased the number of tokens in line with document length we can still detect this as more than 2 matches will happen rarely between completely dissimilar docs. Our storage requirements would then become a function of total document size in the collection rather than number. It is still linear.
Anyway HTH
Cheers
tachyon
|
|---|