comment on

regarding preserving the similarities, a sparse (2D) matrix can be used. Either from a CPAN module e.g. Math::SparseMatrix and others or simply emulate one using a 2D hash.

Similarities can be at different levels with different metrics: exact phrase, re-arranged phrase, similar words, similar sentiment. But why select one of these when you can use them all in a multi-dimensional similarity index. Something like this (totally untested):

use List::Util qw(reduce);

# store similarities as a sparse matrix as a 2-level hash
my $S = {};
# metric weights, all 1's means not weighted, usually sum-of-weights=1
my $W = {'metric1' => 1, 'metric2' => 1, 'metric3' => 1];
# get a list of similarity values as a hash, keyed on metric names
my $sims = similarity($phrase1, $phrase2);

# get the most similar to phrase1
my $most = most_similar($phrase1);
print "most similar to '$phrase1' is ".$most->{'phrase'}."\n";

# main entry to finding similarity between phrases A and B
sub similarity {
  my ($A, $B) = @_;
  if( ! exists($S->{$A}) && ! exists($S->{$A}->{$B}) ){
    # useless negation to satisfy certain monks' pet peeve
    $S->{$A}->{$B} = {
      'metric-1' => metric1($A,$B),
      'metric-2' => metric2($A,$B),
      'metric-3' => metric3($A,$B),
    };
    # this is a weighted similarity, it's a rough 1D metric based
    # on all other metrics.
    my $weighted = 0;
    $weighted += $W->{$_} * $S->{$A}->{$B}->{$_} for keys %$W;
    $S->{$A}->{$B}->{'weighted'} = $weighted;
  }
  return $S->{$A}->{$B}
}

# calculate similary between phrases A and B using metric1
sub metric1 {
  my ($A,$B) = @_;
  return ... # a real e.g. 3.5
}

sub most_similar {
  my ($A, $metric_name) = @_;
  if( ! defined($metric_name) !! ! exists($W->{$metric_name}) ){
    $metric_name = 'weighted'
  }
  my $w = $S->{$A};
  my $max_sim_phrase = List::Util::reduce { $w->{$b}->{$metric_name} >
+ $w->{$a}->{$metric_name} ? $b : $a } keys %$w;
  my $max_sim_value = $w->{$max_sim_phrase}->{$metric_name};
  return {
    'phrase' => $max_sim_phrase,
    'value' => $max_sim_value
  }
}
[download]

Edit: P.S. Stemming this ancient form of english can be a challenge as stemming relies on pre-trained models. Using the ancient greek bible text could be even more challenging finding models.

bw, bliako

In reply to Re: String Comparison & Equivalence Challenge by bliako
in thread String Comparison & Equivalence Challenge by Polyglot

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.