I don't know enough about translation in general or XML extraction and rendering of Cyrillic text to comment on those aspects of your post, but certain details of your basic approach caught my eye.
First: Do yourself a favor and always use warnings and strictures (see strict), and lexical variables whenever possible.
Second, the form of the basic matching regex in the OPed code seems inefficient. The appendage of a . (dot) metacharacter to the regex means that it will match and capture each and every character (except a newline). This is compensated in the substitution expression by replacing those characters for which there is no valid translation with the character just captured, a net change of zero.
It would seem more efficient to capture and replace only those character sequences needing translation. This also means you need no /e execution during replacement evaluation.
use warnings; use strict; my %trans = ( q{t'i} => 'ти', q{t'a} => 'тя', q{t'u} => 'тю', q{t'e} => 'те', ); my @signs = sort {length($b) <=> length($a)} keys %trans; @signs = map quotemeta($_), @signs; my $re = join '|', @signs, '.'; print "original: '$re' \n"; my ($cyril) = map qr{ $_ }xms, join ' | ', map quotemeta, sort { length($b) <=> length($a) } keys %trans ; print "suggested: $cyril \n"; my $text = "t'i don't understand t'any cyrillic."; print "raw: [$text] \n"; $text =~ s{ ($cyril) }{$trans{$1}}xmsgo; print "translation: [$text] \n";
Output:
c:\@Work\Perl\monks\nikop>perl xlate_cyrillic_1.pl original: 't\'e|t\'i|t\'u|t\'a|.' suggested: (?^msx: t\'e | t\'i | t\'u | t\'a ) raw: [t'i don't understand t'any cyrillic.] translation: [ти don't understand тяny cyrilli +c.]
Further reading re: regexes: perlre, perlrequick, and especially perlretut. Also the Pattern Matching Regular Expressions and Parsing tutorials.
In reply to Re: Transliteration inside an XML file
by AnomalousMonk
in thread Transliteration inside an XML file
by nikop
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |