Hi Monks!

I apologise if my question is very complicated or somehow unclear, I'm very new in this! I've looked into this problem for a while, but now I got in my work to a point where getting this run correctly would save me lots of time. I'm a linguist, and I encounter really often text files which are in the language I study, but in a wrong transcription or orthography. I understood that Perl can help me with converting them to another character set, and after looking models and hints from several transliteration scripts I found online I ended up to this, and it works very well:

#!/usr/bin/perl %trans = ( "t'i" =>'&#1090;&#1080;', "t'a" =>'&#1090;&#1103;', "t'u" =>'&#1090;&#1102;', "t'e" =>'&#1090;&#1077;', #It continues like this for several hundred lines, this is just a snip +pet. So it just goes through all character combinations and turns the + text to cyrillic. ); # Actual Translation Logic: @signs = sort {length($b) <=> length($a)} keys %trans; @signs = map quotemeta($_), @signs; $re = join '|', @signs, '.'; # Read Input from Stdin - one line at a time while (<STDIN>) { $input = "$_"; $input =~ s/($re)/exists($trans{$1}) ? $trans{$1} : $1/geo; print $input, ""; }

It does its job well and converts text like "menö šuöny niko" to "менӧ шуӧны нико".

However, I often have the old transcription inside an XML file. They are done in program called ELAN. It has basically a structure like this:

<?xml version="1.0" encoding="UTF-8"?> <ANNOTATION_DOCUMENT> <TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT= +"S1" TIER_ID="orth@S1"> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1978" ANNOTATION_REF="a2"> <ANNOTATION_VALUE>menö šuöny niko</ANNOTATION_VALUE> </REF_ANNOTATION> </ANNOTATION> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1979" ANNOTATION_REF="a5"> <ANNOTATION_VALUE>at't'ö perl manastyrly!</ANNOTATION_ +VALUE> </REF_ANNOTATION> </ANNOTATION> </TIER> </ANNOTATION_DOCUMENT>

So I would like to run the transliteration script to the text: "menö šuöny niko" inside the structure:

<ANNOTATION_VALUE>menö šuöny niko</ANNOTATION_VALUE>

However, this would need to happen only in the nodes inside the structure:

<TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT="S1" + TIER_ID="orth@S1"> </TIER>

So the final result would be like:

<?xml version="1.0" encoding="UTF-8"?> <ANNOTATION_DOCUMENT> <TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT= +"S1" TIER_ID="orth@S1"> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1978" ANNOTATION_REF="a2"> <ANNOTATION_VALUE>&#1084;&#1077;&#1085;&#1255; &#1096; +&#1091;&#1255;&#1085;&#1099; &#1085;&#1080;&#1082;&#1086;</ANNOTATION +_VALUE> </REF_ANNOTATION> </ANNOTATION> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1979" ANNOTATION_REF="a5"> <ANNOTATION_VALUE>&#1072;&#1090;&#1090;&#1100;&#1255; +&#1087;&#1077;&#1088;&#1083; &#1084;&#1072;&#1085;&#1072;&#1089;&#109 +0;&#1099;&#1088;&#1083;&#1099;!</ANNOTATION_VALUE> </REF_ANNOTATION> </ANNOTATION> </TIER> </ANNOTATION_DOCUMENT>

It would need to do the change only here as there are other tiers with different data that should remain as it is.

Also if you think I should specifically read something more about this I'm ready to do that. I honestly want to learn Perl. I didn't know if it is ok to post really long pieces of code, so I just took these small pieces that illustrate what I'm doing. I guess I would need to select the right XML node in XPath or something similar, but I have no idea where to put this into the perl script! I've been learning about Perl and XML during the last months, but I'm still taking very early steps.

Thank you for all the help!


In reply to Transliteration inside an XML file by nikop

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.