in reply to Re: Levenstein distance transcription
in thread Levenstein distance transcription
The Levenshtein algorithm finds the shortest edit script. In your example this would be:
use Algorithm::Diff qw(sdiff); use Data::Dumper; my $a=sdiff( [split(/\W+/,"the quick brown fox")], [split(/\W+/,"before the quick brown fox")] ); print Dumper($a); $VAR1 = [ [ '+', '', 'before' ], [ 'u', 'the', 'the' ], [ 'u', 'quick', 'quick' ], [ 'u', 'brown', 'brown' ], [ 'u', 'fox', 'fox' ] ];
Algorithm::Diff (aka A::D) uses the Hunt-Szymansky-Algorithm which is an improved Levenshtein-Algorithm. The nice thing of A::D is the use of arrays as input, thus A::D can be used for everything, which can be represented as array of strings. Result can be edit-distance, edit-script, length of longest common substring (LLCS), longest common substring (LCS), and (global) alignment.
A::D is reasonable fast, A::DXS is a lot faster, but my private versions of them have approximately double speed. Of course, String::Similarity, purely string based, only determining LLCS, implementing Meyer's algorithm in C is ten times faster.</>
|
---|