This will give you the chunks of a string:
@tokens = sort { length $b <=> length $a } qw(dny kh k h);
$re = join "|", @tokens;
@chunks = split /($re)/, $text;
I'm building a regular expression that matches what you're
looking for. I use the fact that Perl's RE engine tries
alternated choices ("food|foo") from left to right, so I
put the longest bits first. While the split() function
ordinarily returns only the bits of the string the don't
match the regular expression, by parenthesizing portions of
the regexp, those portions are also returned.
If you have the correspondences, though, you can do a
transliterator quite easily:
%changes = ( dny => 225,
kh => 35,
k => 12, # ...
);
$re = join "|", sort { length $b <=> length $a }
keys %changes;
$text =~ s/($re)/chr $changes{$1}/ge;
I build a regular expression from the keys of the hash,
longest to shortest. Then I search and replace on the
string. Everywhere I find something from the hash, I
replace it with the character whose code is in the hash.
The /e flag says the second portion of the
s/// is Perl code that generates the replacement string.
| [reply] |
First, I know nothing about roman or the intricacies of the
language so sorry if I totally miss what you are tying to do.
From what it seems you want to do, this will work:
$chars = 'dny|kh|rj|ee|\w';
$foo = 'khatos';
push(@a, $1) while ($foo=~/($chars)/g);
print join(' ', @a), "\n";
$foo = 'mukharjee';
push(@b, $1) while ($foo=~/($chars)/g);
print join(' ', @b) , "\n";
Results:
kh a t o s
m u kh a rj ee
There may be a more elegant way though.
The groupings in $chars is all the character combinations
that should be considered a single term. $chars will obviously
have to be expanded to include all the possible groups
for the language, just keep the 3 character groups in the front
then the 2 character groups.
| [reply] [d/l] |
Roman isn't a language it's the alphabet. You may not know much about it but you used it to write your message :-)
What is being writen here is a tool to allow you to type in another language using the standard "english" keyboard and convert groups of "english" (or more correctly roman) characters into the characters needed for the other language (which aren't on your keyboard)
Nice solution BTW
Nuance
| [reply] |
Roman isn't a language it's the alphabet.
Doh! That was dumb, I knew that. Thanks for reminding me though.
gnat's solution has the elegance I was looking for, nice.
| [reply] |
Many thanks for all the answers. This helps as lot. I will try with these expressions. The problem is getting more interesting as I have started to look deeper. Somehow intutively I have a feeling that we can make better use of vowel patterns but don't know how. It is guranteed all the vowels in the language will be made from combination of english vowels : aeiou for example
mukharjee -> m (non vowel) + u (vowel) + kh (non vowel)+ a (vowel) + rj (non vowel) + ee (vowel)
I feel now that if I can use of this vowel-nonvowel pattern,
I won't need the combination of all characters and they will take care of them selves. For example "k" will be always follwed by a vowel(combnation of aeiou) and so does the "kh" so by splitting on this will take care of whether it is "k", "kh" or say "khx"
So what may be needed is get all character till you find anything till it matches (aeiou) , then get all characters till a non aeiou is found (effectively getting vowel) and so on.
Any suggestions
| [reply] |
This is an ugly, half baked idea, but you might do something like this:
while ($word) {
($consonants, $vowel, $word) = split(/([aeiou])/, $word, 2);
# do something here
}
I really like the parenthesis collection feature in split. | [reply] [d/l] |