Re^2: Levenstein distance transcription

The Levenshtein algorithm finds the shortest edit script. In your example this would be:

use Algorithm::Diff qw(sdiff);
use Data::Dumper;
my $a=sdiff(
  [split(/\W+/,"the quick brown fox")],
  [split(/\W+/,"before the quick brown fox")]
);
print Dumper($a);

$VAR1 = [
          [
            '+',
            '',
            'before'
          ],
          [
            'u',
            'the',
            'the'
          ],
          [
            'u',
            'quick',
            'quick'
          ],
          [
            'u',
            'brown',
            'brown'
          ],
          [
            'u',
            'fox',
            'fox'
          ]
        ];
[download]

Algorithm::Diff (aka A::D) uses the Hunt-Szymansky-Algorithm which is an improved Levenshtein-Algorithm. The nice thing of A::D is the use of arrays as input, thus A::D can be used for everything, which can be represented as array of strings. Result can be edit-distance, edit-script, length of longest common substring (LLCS), longest common substring (LCS), and (global) alignment.

A::D is reasonable fast, A::DXS is a lot faster, but my private versions of them have approximately double speed. Of course, String::Similarity, purely string based, only determining LLCS, implementing Meyer's algorithm in C is ten times faster.</>

Comment on Re^2: Levenstein distance transcription Download Code