regarding preserving the similarities, a sparse (2D) matrix can be used. Either from a CPAN module e.g. Math::SparseMatrix and others or simply emulate one using a 2D hash.
Similarities can be at different levels with different metrics: exact phrase, re-arranged phrase, similar words, similar sentiment. But why select one of these when you can use them all in a multi-dimensional similarity index. Something like this (totally untested):
use List::Util qw(reduce);
# store similarities as a sparse matrix as a 2-level hash
my $S = {};
# metric weights, all 1's means not weighted, usually sum-of-weights=1
my $W = {'metric1' => 1, 'metric2' => 1, 'metric3' => 1];
# get a list of similarity values as a hash, keyed on metric names
my $sims = similarity($phrase1, $phrase2);
# get the most similar to phrase1
my $most = most_similar($phrase1);
print "most similar to '$phrase1' is ".$most->{'phrase'}."\n";
# main entry to finding similarity between phrases A and B
sub similarity {
my ($A, $B) = @_;
if( ! exists($S->{$A}) && ! exists($S->{$A}->{$B}) ){
# useless negation to satisfy certain monks' pet peeve
$S->{$A}->{$B} = {
'metric-1' => metric1($A,$B),
'metric-2' => metric2($A,$B),
'metric-3' => metric3($A,$B),
};
# this is a weighted similarity, it's a rough 1D metric based
# on all other metrics.
my $weighted = 0;
$weighted += $W->{$_} * $S->{$A}->{$B}->{$_} for keys %$W;
$S->{$A}->{$B}->{'weighted'} = $weighted;
}
return $S->{$A}->{$B}
}
# calculate similary between phrases A and B using metric1
sub metric1 {
my ($A,$B) = @_;
return ... # a real e.g. 3.5
}
sub most_similar {
my ($A, $metric_name) = @_;
if( ! defined($metric_name) !! ! exists($W->{$metric_name}) ){
$metric_name = 'weighted'
}
my $w = $S->{$A};
my $max_sim_phrase = List::Util::reduce { $w->{$b}->{$metric_name} >
+ $w->{$a}->{$metric_name} ? $b : $a } keys %$w;
my $max_sim_value = $w->{$max_sim_phrase}->{$metric_name};
return {
'phrase' => $max_sim_phrase,
'value' => $max_sim_value
}
}
Edit: P.S. Stemming this ancient form of english can be a challenge as stemming relies on pre-trained models. Using the ancient greek bible text could be even more challenging finding models.
bw, bliako
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.