in reply to String Comparison & Equivalence Challenge

regarding preserving the similarities, a sparse (2D) matrix can be used. Either from a CPAN module e.g. Math::SparseMatrix and others or simply emulate one using a 2D hash.

Similarities can be at different levels with different metrics: exact phrase, re-arranged phrase, similar words, similar sentiment. But why select one of these when you can use them all in a multi-dimensional similarity index. Something like this (totally untested):

use List::Util qw(reduce); # store similarities as a sparse matrix as a 2-level hash my $S = {}; # metric weights, all 1's means not weighted, usually sum-of-weights=1 my $W = {'metric1' => 1, 'metric2' => 1, 'metric3' => 1]; # get a list of similarity values as a hash, keyed on metric names my $sims = similarity($phrase1, $phrase2); # get the most similar to phrase1 my $most = most_similar($phrase1); print "most similar to '$phrase1' is ".$most->{'phrase'}."\n"; # main entry to finding similarity between phrases A and B sub similarity { my ($A, $B) = @_; if( ! exists($S->{$A}) && ! exists($S->{$A}->{$B}) ){ # useless negation to satisfy certain monks' pet peeve $S->{$A}->{$B} = { 'metric-1' => metric1($A,$B), 'metric-2' => metric2($A,$B), 'metric-3' => metric3($A,$B), }; # this is a weighted similarity, it's a rough 1D metric based # on all other metrics. my $weighted = 0; $weighted += $W->{$_} * $S->{$A}->{$B}->{$_} for keys %$W; $S->{$A}->{$B}->{'weighted'} = $weighted; } return $S->{$A}->{$B} } # calculate similary between phrases A and B using metric1 sub metric1 { my ($A,$B) = @_; return ... # a real e.g. 3.5 } sub most_similar { my ($A, $metric_name) = @_; if( ! defined($metric_name) !! ! exists($W->{$metric_name}) ){ $metric_name = 'weighted' } my $w = $S->{$A}; my $max_sim_phrase = List::Util::reduce { $w->{$b}->{$metric_name} > + $w->{$a}->{$metric_name} ? $b : $a } keys %$w; my $max_sim_value = $w->{$max_sim_phrase}->{$metric_name}; return { 'phrase' => $max_sim_phrase, 'value' => $max_sim_value } }

Edit: P.S. Stemming this ancient form of english can be a challenge as stemming relies on pre-trained models. Using the ancient greek bible text could be even more challenging finding models.

bw, bliako

Replies are listed 'Best First'.
Re^2: String Comparison & Equivalence Challenge
by LanX (Saint) on Mar 14, 2021 at 17:52 UTC
    > Using the ancient greek bible text could be even more challenging finding models.

    Only the New Testament is originally in Greek, the old one is in Hebrew and Aramaic AFAIK.

    Please correct me.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      I had no idea but it's reasomable (edit: what you say).