If the input isn't "pure unicode IPA", you'll probably end up with a look-up table (i.e. a hash) for transliterating various digraph character sequences (e.g. 'dz' as two characters) into the corresponding singular unicode code point. You do regex substitutions with each of those first, before doing your edit-distance computation -- e.g.
my %translate = ( 'dz' => "\x{02a3}", 'ts' => "\x{02a6}" ... ); $_ = "string with ts and dz digraphs" for my $digraph ( keys %translate ) { s/$digraph/$translate{$digraph}/g; }
As for the notion of using distinctive features instead, yeah, it's an attractive idea, but very tricky. Each original letter (phonemic segment) needs to become a "word", such that every "word" is the same length, comprising a fixed sequence of feature symbols. Creating the lookup table of letters/phonemes to feature symbol "word" strings will be half the work, and then trying to make sense of the edit-distance results on those strings will be the other half. (It's a lot of work.)
Actually, I think just using the original phonetic/phonemic segment letters will suffice, since the diffs between related language varieties will tend to cluster around particular pairings of related phonemes, and from those pairings, the relationships of distinctive feature patterns will tend to be fairly obvious.
In reply to Re: Reading IPA characters in Perl (Unicode?)
by graff
in thread Reading IPA characters in Perl (Unicode?)
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |