So, is it the case that the input excel spreadsheet does not use Unicode characters? (Seems like it should, if it's trying to convey IPA symbols.) If there's unicode in the input spreadsheet, I expect it'll be encoded as UTF-16BE (go figure) -- here's an example for handling xls input with unicode content: xls2tsv.

If the input isn't "pure unicode IPA", you'll probably end up with a look-up table (i.e. a hash) for transliterating various digraph character sequences (e.g. 'dz' as two characters) into the corresponding singular unicode code point. You do regex substitutions with each of those first, before doing your edit-distance computation -- e.g.

my %translate = ( 'dz' => "\x{02a3}", 'ts' => "\x{02a6}" ... ); $_ = "string with ts and dz digraphs" for my $digraph ( keys %translate ) { s/$digraph/$translate{$digraph}/g; }

As for the notion of using distinctive features instead, yeah, it's an attractive idea, but very tricky. Each original letter (phonemic segment) needs to become a "word", such that every "word" is the same length, comprising a fixed sequence of feature symbols. Creating the lookup table of letters/phonemes to feature symbol "word" strings will be half the work, and then trying to make sense of the edit-distance results on those strings will be the other half. (It's a lot of work.)

Actually, I think just using the original phonetic/phonemic segment letters will suffice, since the diffs between related language varieties will tend to cluster around particular pairings of related phonemes, and from those pairings, the relationships of distinctive feature patterns will tend to be fairly obvious.


In reply to Re: Reading IPA characters in Perl (Unicode?) by graff
in thread Reading IPA characters in Perl (Unicode?) by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.