in reply to Transliteration inside an XML file

I don't know enough about translation in general or XML extraction and rendering of Cyrillic text to comment on those aspects of your post, but certain details of your basic approach caught my eye.

First: Do yourself a favor and always use warnings and strictures (see strict), and lexical variables whenever possible.

Second, the form of the basic matching regex in the OPed code seems inefficient. The appendage of a  . (dot) metacharacter to the regex means that it will match and capture each and every character (except a newline). This is compensated in the substitution expression by replacing those characters for which there is no valid translation with the character just captured, a net change of zero.

It would seem more efficient to capture and replace only those character sequences needing translation. This also means you need no  /e execution during replacement evaluation.

use warnings; use strict; my %trans = ( q{t'i} => '&#1090;&#1080;', q{t'a} => '&#1090;&#1103;', q{t'u} => '&#1090;&#1102;', q{t'e} => '&#1090;&#1077;', ); my @signs = sort {length($b) <=> length($a)} keys %trans; @signs = map quotemeta($_), @signs; my $re = join '|', @signs, '.'; print "original: '$re' \n"; my ($cyril) = map qr{ $_ }xms, join ' | ', map quotemeta, sort { length($b) <=> length($a) } keys %trans ; print "suggested: $cyril \n"; my $text = "t'i don't understand t'any cyrillic."; print "raw: [$text] \n"; $text =~ s{ ($cyril) }{$trans{$1}}xmsgo; print "translation: [$text] \n";

Output:

c:\@Work\Perl\monks\nikop>perl xlate_cyrillic_1.pl original: 't\'e|t\'i|t\'u|t\'a|.' suggested: (?^msx: t\'e | t\'i | t\'u | t\'a ) raw: [t'i don't understand t'any cyrillic.] translation: [&#1090;&#1080; don't understand &#1090;&#1103;ny cyrilli +c.]

Further reading re: regexes: perlre, perlrequick, and especially perlretut. Also the Pattern Matching Regular Expressions and Parsing tutorials.

Replies are listed 'Best First'.
Re^2: Transliteration inside an XML file
by graff (Chancellor) on Jun 19, 2014 at 22:14 UTC
    It would seem more efficient to capture and replace only those character sequences needing translation.

    Well, no, actually - not in this case. The OP is transliterating from a "Romanized" (Latin-alphabet-based) "transcription" into Cyrillic. All characters in a given string will need to be replaced, because Cyrillic has its own dedicated "page" within the Unicode table. The incoming Latin characters (and diacritic marks) may come from the ASCII table or somewhere else, but when the transliteration is finished, every character will have been replaced.