killedar has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a transliterator for converting roman into another language. It is much easier to write using roman/keyboard but the output will show in the appropriate language font. For example: dny = character #225; kh = character #35; k = character #12; h = character #10; a = character #8; kha = character #35#8 and not #12#10#8; what pattern should I write to split a string in those characters? for example khatos should split into kh, a, t,o,s So triplecharacter pattern should match first and then double character then single. It is guranteed that there will be always some vovels in between but they can be multcharcter always. for example word mukharjee should split into m,u, kh,a, rj,ee or is it possible get first character that are not vovels, then get vowels then chracters again? once I split it, all I need to do is to find associated charcter from assoc array and print. Thanks for your help
  • Comment on Regexp and transliteration between languages

Replies are listed 'Best First'.
RE: Regexp and transliteration between languages
by gnat (Beadle) on Jun 15, 2000 at 19:38 UTC
    This will give you the chunks of a string:
    @tokens = sort { length $b <=> length $a } qw(dny kh k h);
    $re     = join "|", @tokens;
    @chunks = split /($re)/, $text;
    
    I'm building a regular expression that matches what you're looking for. I use the fact that Perl's RE engine tries alternated choices ("food|foo") from left to right, so I put the longest bits first. While the split() function ordinarily returns only the bits of the string the don't match the regular expression, by parenthesizing portions of the regexp, those portions are also returned.

    If you have the correspondences, though, you can do a transliterator quite easily:

    %changes = ( dny => 225,
                 kh  => 35,
                 k   => 12, # ...
    );
    $re = join "|", sort { length $b <=> length $a }
                    keys %changes;
    $text =~ s/($re)/chr $changes{$1}/ge;
    
    I build a regular expression from the keys of the hash, longest to shortest. Then I search and replace on the string. Everywhere I find something from the hash, I replace it with the character whose code is in the hash. The /e flag says the second portion of the s/// is Perl code that generates the replacement string.

Re: Regexp and transliteration between languages
by perlmonkey (Hermit) on Jun 15, 2000 at 10:47 UTC
    First, I know nothing about roman or the intricacies of the language so sorry if I totally miss what you are tying to do. From what it seems you want to do, this will work:
    $chars = 'dny|kh|rj|ee|\w'; $foo = 'khatos'; push(@a, $1) while ($foo=~/($chars)/g); print join(' ', @a), "\n"; $foo = 'mukharjee'; push(@b, $1) while ($foo=~/($chars)/g); print join(' ', @b) , "\n";
    Results:
    kh a t o s
    m u kh a rj ee

    There may be a more elegant way though. The groupings in $chars is all the character combinations that should be considered a single term. $chars will obviously have to be expanded to include all the possible groups for the language, just keep the 3 character groups in the front then the 2 character groups.

      Roman isn't a language it's the alphabet. You may not know much about it but you used it to write your message :-)

      What is being writen here is a tool to allow you to type in another language using the standard "english" keyboard and convert groups of "english" (or more correctly roman) characters into the characters needed for the other language (which aren't on your keyboard)

      Nice solution BTW

      Nuance

        Roman isn't a language it's the alphabet.

        Doh! That was dumb, I knew that. Thanks for reminding me though.

        gnat's solution has the elegance I was looking for, nice.
Re: Regexp and transliteration between languages
by killedar (Initiate) on Jun 16, 2000 at 00:05 UTC
    Many thanks for all the answers. This helps as lot. I will try with these expressions. The problem is getting more interesting as I have started to look deeper. Somehow intutively I have a feeling that we can make better use of vowel patterns but don't know how. It is guranteed all the vowels in the language will be made from combination of english vowels : aeiou for example mukharjee -> m (non vowel) + u (vowel) + kh (non vowel)+ a (vowel) + rj (non vowel) + ee (vowel) I feel now that if I can use of this vowel-nonvowel pattern, I won't need the combination of all characters and they will take care of them selves. For example "k" will be always follwed by a vowel(combnation of aeiou) and so does the "kh" so by splitting on this will take care of whether it is "k", "kh" or say "khx" So what may be needed is get all character till you find anything till it matches (aeiou) , then get all characters till a non aeiou is found (effectively getting vowel) and so on. Any suggestions
      This is an ugly, half baked idea, but you might do something like this:
      while ($word) { ($consonants, $vowel, $word) = split(/([aeiou])/, $word, 2); # do something here }
      I really like the parenthesis collection feature in split.