Re: Re: Re: calculate matching words/sentence

Sorry tachyon. Given the OP's statement about having "thousands of sentences in a DB", I can't agree with you. The problem is that it would be necessary to compare every new line against every old line each time. Even where there is absolutely no correlation between them at all.

If the time taken to match any new sentence against all the existing is going to be a reduced to a reasonable level, then you need to find an algorithm that pre-computes a comparison value for the existing strings.

Or, at least allows the thousands to be reduced to a managable number before applying the comparisons. Generating an inverted index is trivial and can easily be done once and stored in the database, only requiring update when new reference sentences are added. Storing this in memory should be manageable even for thousands of sentences.

The first stage of comparing a new sentence then becomes a rapid lookup of the words it contains against the index and produces a short-list of candidate DB sentences that have at least some correlation. These can be sorted by the strength of that correlation and further analysis performed.

As a crude example.

#! perl -slw
use strict;
use Data::Dumper;

# Read an array of the DB sentences and index them.
my @lines;
my %lookup;
while( my $line = <DATA> ) {
    push @{ $lookup{ $_ } }, $. for split ' ', $line;
    $lines[ $. ] = $line;
}
close DATA;
chomp @lines;


# Process each newline by
while( 1 ) {
    printf "\nEnter a sentence: "; my $newline = <STDIN>;
    last unless $newline;
    my @words = split ' ', $newline;

    # building a hash with the sentence (line) number as key
    # and the number of words it has in common
    # with the new sentence as the value

    my %occurs;
    $occurs{ $_ }++ for map{
        my $aref = $lookup{ $_ };
        defined $aref
        ? @$aref
        : ()
    } @words;

    print "No common words found" and next unless keys %occurs;

    # Process (display) only thise lines that have some correlation
    # sort by the strength of the correlation.

    print "Line $_: '$lines[ $_ ]' has $occurs{ $_ } words in common",
        for sort{ $occurs{ $b } <=> $occurs{ $a } } keys %occurs
}
__DATA__
The quick brown fox jumps over the lazy dog
The fox jumps the dog
The dog lazily watched as the fox quickly jumped over him
The quick red and brown foxes jumped easily over the lazy dog
The lazy fox was caught as it tried to jump over the quick dogs.
The dog lazed as the fox, quick and brown, jumped over.
The smart dog foxed the fox by lazing in the drop zone until he jumped
+.
[download]

Example output ('scuse the mess that autowrap makes of this!).

P:\test>289106

Enter a sentence: the quick brown fox
Line 1: 'The quick brown fox jumps over the lazy dog' has 4 words in c
+ommon
Line 4: 'The quick red and brown foxes jumped easily over the lazy dog
+' has 3 words in common
Line 7: 'The smart dog foxed the fox by lazing in the drop zone until 
+he jumped.' has 3 words in common
Line 5: 'The lazy fox was caught as it tried to jump over the quick do
+gs.' has 3 words in common
Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has 
+2 words in common
Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha
+s 2 words in common
Line 2: 'The fox jumps the dog' has 2 words in common

Enter a sentence: the quick brown fox jumps over the lazy dog
Line 1: 'The quick brown fox jumps over the lazy dog' has 9 words in c
+ommon
Line 4: 'The quick red and brown foxes jumped easily over the lazy dog
+' has 7 words in common
Line 7: 'The smart dog foxed the fox by lazing in the drop zone until 
+he jumped.' has 6 words in common
Line 5: 'The lazy fox was caught as it tried to jump over the quick do
+gs.' has 6 words in common
Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha
+s 5 words in common
Line 2: 'The fox jumps the dog' has 5 words in common
Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has 
+4 words in common

Enter a sentence: quick foxes jump over lazy dogs
Line 4: 'The quick red and brown foxes jumped easily over the lazy dog
+' has 4 words in common
Line 5: 'The lazy fox was caught as it tried to jump over the quick do
+gs.' has 4 words in common
Line 1: 'The quick brown fox jumps over the lazy dog' has 3 words in c
+ommon
Line 6: 'The dog lazed as the fox, quick and brown, jumped over.' has 
+1 words in common
Line 3: 'The dog lazily watched as the fox quickly jumped over him' ha
+s 1 words in common

Enter a sentence: Most volcanos erupt mulberry jam sandwiches under no
+rmal pressure
No common words found

Enter a sentence: ^Z
[download]

This suffers many flaws, the lack of stemming is the major one but I played with Lingua::Stem::En and it doesn't do what I would like it to do.

This could be improved in many ways, adding some measure of the positional correlation is one idea, but without a clearer picture of the OPs needs, it's just a guessing game as to what might fit.

I have this nagging idea that there is a method of pre-calculating a single number from each sentence that would give a numerical measure of the words(stems) in the sentence (measured against a fixed size dictionary) and their position, but I not sure if it is a vague recollection of something I have seen somewhere or just a fantom idea that hasn't made it to fruition yet.

The idea that you could take the sentence, perform a calculation on it (independant of the reference strings, but possibly in reference to a dictionary of words or stems derived from them) and then use that to extract the 10 or twenty sentences from the DB with the numerically closest signature is appealling. I just can't wrap my head around how it might be done.

That said, once you have short-listed the total set, using Algorithm::Diff or Text::Levenshtein or others to make a final arbitration would be a useful next step in the process, depending upon the OPs reqs.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

Comment on Re: Re: Re: calculate matching words/sentence Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Re: calculate matching words/sentence by tachyon (Chancellor) on Sep 05, 2003 at 11:07 UTC
We aren't told what this is for but my wild guess was plagurism detection. We wrote one for a client a while back. It depended to a degree on the temptation to cut and paste chunks of text. The approach was simple enough. Tokenize by spliting on \n\n . and , into roughly sentence size fragments, then strip case and all punctuation. MD5 hash the fragments and index in a DB to the original docs (hash as primary key, link list to docs). To test for a ripped off doc all you need to do is tokenize it, look in the DB for matching tokens and accululate a count for each doc the token exists in. The higher the count for a given doc the more common fragments the test doc has with the known doc. Linear regression shows obvious threshold values. This is of course a somewhat naive approach but works suprisingly well in practice and is certainly a reasonable way to screen thousands of docs very rapidly. Each doc only gets tokenized once and accumulates in the DB. Stripping case and punctation mean's a cut'n'paster has to work that much harder to avoid detection. Rearranging word order (but not changes in case or most puntuation) will of course defeat this. However if you have to rearrange every sentence to avoid detection it starts to become a hell of a lot more effort to R&D someone elses work. The reality is that you only have to catch a % of would be cheats to discourage the practice. In WWI the French Officers in the trenches would shoot the first man to refuse to go over the top Pour encourager les autres cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply]

Replies are listed 'Best First'.

Re: Re: Re: Re: calculate matching words/sentence
by tachyon (Chancellor) on Sep 05, 2003 at 11:07 UTC

We aren't told what this is for but my wild guess was plagurism detection. We wrote one for a client a while back. It depended to a degree on the temptation to cut and paste chunks of text. The approach was simple enough. Tokenize by spliting on \n\n . and , into roughly sentence size fragments, then strip case and all punctuation. MD5 hash the fragments and index in a DB to the original docs (hash as primary key, link list to docs). To test for a ripped off doc all you need to do is tokenize it, look in the DB for matching tokens and accululate a count for each doc the token exists in. The higher the count for a given doc the more common fragments the test doc has with the known doc. Linear regression shows obvious threshold values.

This is of course a somewhat naive approach but works suprisingly well in practice and is certainly a reasonable way to screen thousands of docs very rapidly. Each doc only gets tokenized once and accumulates in the DB. Stripping case and punctation mean's a cut'n'paster has to work that much harder to avoid detection. Rearranging word order (but not changes in case or most puntuation) will of course defeat this. However if you have to rearrange every sentence to avoid detection it starts to become a hell of a lot more effort to R&D someone elses work. The reality is that you only have to catch a % of would be cheats to discourage the practice. In WWI the French Officers in the trenches would shoot the first man to refuse to go over the top Pour encourager les autres

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

[reply]