Here's a fairly simple method to measure similarity of strings. Given $one and $two, apply text compression to each and to their concatenation. The ratio of the size of them compressed together to the sum of the separately compressed sizes measures their similarity. The smaller, the closer.
#!/usr/bin/perl
use Compress::Zlib 'compress';
# Usage: $arrayref = similarity( LIST)
# Returns: AoA reference to string similarity table for LIST
sub similarity {
my (%single, @ret) = map {$_ => length compress $_} @_;
for my $this (@_) {
push @ret, [
map {
(length compress $this . $_)
/ ($single{$this} + $single{$_})
} @_
];
}
\@ret;
}
my @titles = (
q(The Last Public Hanging In Old West Virginia - Flatt and Scruggs
+),
q(Flatt_and_Scruggs__The_Last_Public_Hanging_In_Old_West_Virginia)
+,
q(Rainy Day Woman Number 12 and 35 - Flatt and Scruggs),
q(Rainy Day Woman Number Twelve and Thirty-five - Bob Dylan),
);
my $results = similarity @titles;
for my $this (@$results) {
print pack('A6' x @$this, map {sprintf '%4.3f', $_} @$this), $/;
}
__END__
0.529 0.715 0.784 0.841
0.708 0.529 0.887 0.870
0.784 0.863 0.536 0.748
0.848 0.863 0.739 0.532
Note that 0.500 is the ideal minimum for that, so subtracting .5 from those would give more impressive differences.
I saw this technique described in a SciAm recently. Will update if I can find out which.
After Compline, Zaxo
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|