Regexp and transliteration between languages

killedar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
RE: Regexp and transliteration between languages by gnat (Beadle) on Jun 15, 2000 at 19:38 UTC
This will give you the chunks of a string: @tokens = sort { length $b <=> length $a } qw(dny kh k h); $re = join "\|", @tokens; @chunks = split /($re)/, $text; I'm building a regular expression that matches what you're looking for. I use the fact that Perl's RE engine tries alternated choices ("food\|foo") from left to right, so I put the longest bits first. While the split() function ordinarily returns only the bits of the string the don't match the regular expression, by parenthesizing portions of the regexp, those portions are also returned. If you have the correspondences, though, you can do a transliterator quite easily: %changes = ( dny => 225, kh => 35, k => 12, # ... ); $re = join "\|", sort { length $b <=> length $a } keys %changes; $text =~ s/($re)/chr $changes{$1}/ge; I build a regular expression from the keys of the hash, longest to shortest. Then I search and replace on the string. Everywhere I find something from the hash, I replace it with the character whose code is in the hash. The /e flag says the second portion of the s/// is Perl code that generates the replacement string.	[reply]
Re: Regexp and transliteration between languages by perlmonkey (Hermit) on Jun 15, 2000 at 10:47 UTC
First, I know nothing about roman or the intricacies of the language so sorry if I totally miss what you are tying to do. From what it seems you want to do, this will work: `$chars = 'dny\|kh\|rj\|ee\|\w'; $foo = 'khatos'; push(@a, $1) while ($foo=~/($chars)/g); print join(' ', @a), "\n"; $foo = 'mukharjee'; push(@b, $1) while ($foo=~/($chars)/g); print join(' ', @b) , "\n";` [download] Results: kh a t o s m u kh a rj ee There may be a more elegant way though. The groupings in $chars is all the character combinations that should be considered a single term. $chars will obviously have to be expanded to include all the possible groups for the language, just keep the 3 character groups in the front then the 2 character groups.	[reply] [d/l]
RE: Re: Regexp and transliteration between languages by nuance (Hermit) on Jun 15, 2000 at 17:13 UTC
Roman isn't a language it's the alphabet. You may not know much about it but you used it to write your message :-) What is being writen here is a tool to allow you to type in another language using the standard "english" keyboard and convert groups of "english" (or more correctly roman) characters into the characters needed for the other language (which aren't on your keyboard) Nice solution BTW *Nuance*	[reply]
RE: RE: Re: Regexp and transliteration between languages by perlmonkey (Hermit) on Jun 15, 2000 at 21:14 UTC
Roman isn't a language it's the alphabet. Doh! That was dumb, I knew that. Thanks for reminding me though. gnat's solution has the elegance I was looking for, nice.	[reply]
Re: Regexp and transliteration between languages by killedar (Initiate) on Jun 16, 2000 at 00:05 UTC
Many thanks for all the answers. This helps as lot. I will try with these expressions. The problem is getting more interesting as I have started to look deeper. Somehow intutively I have a feeling that we can make better use of vowel patterns but don't know how. It is guranteed all the vowels in the language will be made from combination of english vowels : aeiou for example mukharjee -> m (non vowel) + u (vowel) + kh (non vowel)+ a (vowel) + rj (non vowel) + ee (vowel) I feel now that if I can use of this vowel-nonvowel pattern, I won't need the combination of all characters and they will take care of them selves. For example "k" will be always follwed by a vowel(combnation of aeiou) and so does the "kh" so by splitting on this will take care of whether it is "k", "kh" or say "khx" So what may be needed is get all character till you find anything till it matches (aeiou) , then get all characters till a non aeiou is found (effectively getting vowel) and so on. Any suggestions	[reply]
RE: Re: Regexp and transliteration between languages by chromatic (Archbishop) on Jun 16, 2000 at 04:44 UTC
This is an ugly, half baked idea, but you might do something like this: `while ($word) { ($consonants, $vowel, $word) = split(/([aeiou])/, $word, 2); # do something here }` [download] I really like the parenthesis collection feature in split.	[reply] [d/l]