I don't know enough about translation in general or XML extraction and rendering of Cyrillic text to comment on those aspects of your post, but certain details of your basic approach caught my eye.

First: Do yourself a favor and always use warnings and strictures (see strict), and lexical variables whenever possible.

Second, the form of the basic matching regex in the OPed code seems inefficient. The appendage of a  . (dot) metacharacter to the regex means that it will match and capture each and every character (except a newline). This is compensated in the substitution expression by replacing those characters for which there is no valid translation with the character just captured, a net change of zero.

It would seem more efficient to capture and replace only those character sequences needing translation. This also means you need no  /e execution during replacement evaluation.

use warnings; use strict; my %trans = ( q{t'i} => '&#1090;&#1080;', q{t'a} => '&#1090;&#1103;', q{t'u} => '&#1090;&#1102;', q{t'e} => '&#1090;&#1077;', ); my @signs = sort {length($b) <=> length($a)} keys %trans; @signs = map quotemeta($_), @signs; my $re = join '|', @signs, '.'; print "original: '$re' \n"; my ($cyril) = map qr{ $_ }xms, join ' | ', map quotemeta, sort { length($b) <=> length($a) } keys %trans ; print "suggested: $cyril \n"; my $text = "t'i don't understand t'any cyrillic."; print "raw: [$text] \n"; $text =~ s{ ($cyril) }{$trans{$1}}xmsgo; print "translation: [$text] \n";

Output:

c:\@Work\Perl\monks\nikop>perl xlate_cyrillic_1.pl original: 't\'e|t\'i|t\'u|t\'a|.' suggested: (?^msx: t\'e | t\'i | t\'u | t\'a ) raw: [t'i don't understand t'any cyrillic.] translation: [&#1090;&#1080; don't understand &#1090;&#1103;ny cyrilli +c.]

Further reading re: regexes: perlre, perlrequick, and especially perlretut. Also the Pattern Matching Regular Expressions and Parsing tutorials.


In reply to Re: Transliteration inside an XML file by AnomalousMonk
in thread Transliteration inside an XML file by nikop

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.