Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: similar texts !?

by Zaxo (Archbishop)
on Jul 12, 2003 at 15:47 UTC ( [id://273638]=note: print w/replies, xml ) Need Help??

in reply to similar texts !?

Here's a fairly simple method to measure similarity of strings. Given $one and $two, apply text compression to each and to their concatenation. The ratio of the size of them compressed together to the sum of the separately compressed sizes measures their similarity. The smaller, the closer.

#!/usr/bin/perl use Compress::Zlib 'compress'; # Usage: $arrayref = similarity( LIST) # Returns: AoA reference to string similarity table for LIST sub similarity { my (%single, @ret) = map {$_ => length compress $_} @_; for my $this (@_) { push @ret, [ map { (length compress $this . $_) / ($single{$this} + $single{$_}) } @_ ]; } \@ret; } my @titles = ( q(The Last Public Hanging In Old West Virginia - Flatt and Scruggs +), q(Flatt_and_Scruggs__The_Last_Public_Hanging_In_Old_West_Virginia) +, q(Rainy Day Woman Number 12 and 35 - Flatt and Scruggs), q(Rainy Day Woman Number Twelve and Thirty-five - Bob Dylan), ); my $results = similarity @titles; for my $this (@$results) { print pack('A6' x @$this, map {sprintf '%4.3f', $_} @$this), $/; } __END__ 0.529 0.715 0.784 0.841 0.708 0.529 0.887 0.870 0.784 0.863 0.536 0.748 0.848 0.863 0.739 0.532
Note that 0.500 is the ideal minimum for that, so subtracting .5 from those would give more impressive differences.

I saw this technique described in a SciAm recently. Will update if I can find out which.

After Compline,

Replies are listed 'Best First'.
Re: Re: similar texts !?
by allolex (Curate) on Jul 13, 2003 at 08:47 UTC

    Great idea, but a small caveat anyway---the accuracy of this method increases greatly on texts that are a bit longer than MP3 filenames. :)

    (The keyword for any googling on the subject is "maximum entropy". You can have a look here as well.)


Re: Re: similar texts !?
by bugsbunny (Scribe) on Jul 12, 2003 at 17:18 UTC
    cool waiting ... for update

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://273638]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-04-25 02:17 GMT
Find Nodes?
    Voting Booth?

    No recent polls found