comment on

regarding preserving the similarities, a sparse (2D) matrix can be used. Either from a CPAN module e.g. Math::SparseMatrix and others or simply emulate one using a 2D hash.

Similarities can be at different levels with different metrics: exact phrase, re-arranged phrase, similar words, similar sentiment. But why select one of these when you can use them all in a multi-dimensional similarity index. Something like this (totally untested):

use List::Util qw(reduce);

# store similarities as a sparse matrix as a 2-level hash
my $S = {};
# metric weights, all 1's means not weighted, usually sum-of-weights=1
my $W = {'metric1' => 1, 'metric2' => 1, 'metric3' => 1];
# get a list of similarity values as a hash, keyed on metric names
my $sims = similarity($phrase1, $phrase2);

# get the most similar to phrase1
my $most = most_similar($phrase1);
print "most similar to '$phrase1' is ".$most->{'phrase'}."\n";

# main entry to finding similarity between phrases A and B
sub similarity {
  my ($A, $B) = @_;
  if( ! exists($S->{$A}) && ! exists($S->{$A}->{$B}) ){
    # useless negation to satisfy certain monks' pet peeve
    $S->{$A}->{$B} = {
      'metric-1' => metric1($A,$B),
      'metric-2' => metric2($A,$B),
      'metric-3' => metric3($A,$B),
    };
    # this is a weighted similarity, it's a rough 1D metric based
    # on all other metrics.
    my $weighted = 0;
    $weighted += $W->{$_} * $S->{$A}->{$B}->{$_} for keys %$W;
    $S->{$A}->{$B}->{'weighted'} = $weighted;
  }
  return $S->{$A}->{$B}
}

# calculate similary between phrases A and B using metric1
sub metric1 {
  my ($A,$B) = @_;
  return ... # a real e.g. 3.5
}

sub most_similar {
  my ($A, $metric_name) = @_;
  if( ! defined($metric_name) !! ! exists($W->{$metric_name}) ){
    $metric_name = 'weighted'
  }
  my $w = $S->{$A};
  my $max_sim_phrase = List::Util::reduce { $w->{$b}->{$metric_name} >
+ $w->{$a}->{$metric_name} ? $b : $a } keys %$w;
  my $max_sim_value = $w->{$max_sim_phrase}->{$metric_name};
  return {
    'phrase' => $max_sim_phrase,
    'value' => $max_sim_value
  }
}
[download]

Edit: P.S. Stemming this ancient form of english can be a challenge as stemming relies on pre-trained models. Using the ancient greek bible text could be even more challenging finding models.

bw, bliako

In reply to Re: String Comparison & Equivalence Challenge by bliako
in thread String Comparison & Equivalence Challenge by Polyglot

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Pathologically Eclectic Rubbish Lister
	PerlMonks