Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hello all,

I need to compare two strings and find how many characters should be deleted and how many should be inserted. I know that this is edit distance, but how can I get those numbers?

For instance:

This is an old question. This is a new question.

The result should be 4 deletions and 3 insertions.

Thanks

Replies are listed 'Best First'.
Re: Edit distance between two strings
by AppleFritter (Vicar) on Jul 08, 2014 at 09:44 UTC

    The standard metric for that is called the Levenshtein distance. A quick CPAN search suggests Text::Levenshtein or Text::Fuzzy. The latter also has the ability to show you what edits were made in which positions.

    Note that the Levenshtein distance allows for substitutions in addition to insertions and deletions. I'm not aware of a "standard" edit distance metric that only uses insertions and deletions, but you could fiddle with the weights in Text::WagnerFischer to rule out substitutions.

    #!/usr/bin/perl use feature qw/say/; use warnings; use strict; use Text::Levenshtein; use Text::Fuzzy; use Text::WagnerFischer; my $old = "This is an old question."; my $new = "This is a new question."; say Text::Levenshtein::distance($old, $new); my $tf = Text::Fuzzy->new($old); say $tf->distance($new); my ($distance, $edits) = Text::Fuzzy::distance_edits($old, $new); say $edits; say Text::WagnerFischer::distance([0, 1, 999], $old, $new);

    This prints:

    $ perl test.pl 4 4 kkkkkkkkkdkrrrkkkkkkkkkk 7 $

    Take your pick.

Re: Edit distance between two strings
by Laurent_R (Canon) on Jul 08, 2014 at 18:00 UTC
    AppleFritter has probably provided you with what you need, with quite a number of details. Just a couple of additional clues.

    One thing you might do is to google for the Baeza-Yates k-mismatches algorithm, the Wu-Manber k-differences algorithm and the longest common subsequence (LCS) algorithms.

    Some additional modules of possible interest: Jarkko Hietaniemi's String::Approx and Mark-Jason Dominus' Algorithm::Diff.