nikop has asked for the wisdom of the Perl Monks concerning the following question:
Hi Monks!
I apologise if my question is very complicated or somehow unclear, I'm very new in this! I've looked into this problem for a while, but now I got in my work to a point where getting this run correctly would save me lots of time. I'm a linguist, and I encounter really often text files which are in the language I study, but in a wrong transcription or orthography. I understood that Perl can help me with converting them to another character set, and after looking models and hints from several transliteration scripts I found online I ended up to this, and it works very well:
#!/usr/bin/perl %trans = ( "t'i" =>'ти', "t'a" =>'тя', "t'u" =>'тю', "t'e" =>'те', #It continues like this for several hundred lines, this is just a snip +pet. So it just goes through all character combinations and turns the + text to cyrillic. ); # Actual Translation Logic: @signs = sort {length($b) <=> length($a)} keys %trans; @signs = map quotemeta($_), @signs; $re = join '|', @signs, '.'; # Read Input from Stdin - one line at a time while (<STDIN>) { $input = "$_"; $input =~ s/($re)/exists($trans{$1}) ? $trans{$1} : $1/geo; print $input, ""; }
It does its job well and converts text like "menö šuöny niko" to "менӧ шуӧны нико".
However, I often have the old transcription inside an XML file. They are done in program called ELAN. It has basically a structure like this:
<?xml version="1.0" encoding="UTF-8"?> <ANNOTATION_DOCUMENT> <TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT= +"S1" TIER_ID="orth@S1"> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1978" ANNOTATION_REF="a2"> <ANNOTATION_VALUE>menö šuöny niko</ANNOTATION_VALUE> </REF_ANNOTATION> </ANNOTATION> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1979" ANNOTATION_REF="a5"> <ANNOTATION_VALUE>at't'ö perl manastyrly!</ANNOTATION_ +VALUE> </REF_ANNOTATION> </ANNOTATION> </TIER> </ANNOTATION_DOCUMENT>
So I would like to run the transliteration script to the text: "menö šuöny niko" inside the structure:
<ANNOTATION_VALUE>menö šuöny niko</ANNOTATION_VALUE>
However, this would need to happen only in the nodes inside the structure:
<TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT="S1" + TIER_ID="orth@S1"> </TIER>
So the final result would be like:
<?xml version="1.0" encoding="UTF-8"?> <ANNOTATION_DOCUMENT> <TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT= +"S1" TIER_ID="orth@S1"> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1978" ANNOTATION_REF="a2"> <ANNOTATION_VALUE>менӧ ш +уӧны нико</ANNOTATION +_VALUE> </REF_ANNOTATION> </ANNOTATION> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1979" ANNOTATION_REF="a5"> <ANNOTATION_VALUE>аттьӧ +перл манасm +0;ырлы!</ANNOTATION_VALUE> </REF_ANNOTATION> </ANNOTATION> </TIER> </ANNOTATION_DOCUMENT>
It would need to do the change only here as there are other tiers with different data that should remain as it is.
Also if you think I should specifically read something more about this I'm ready to do that. I honestly want to learn Perl. I didn't know if it is ok to post really long pieces of code, so I just took these small pieces that illustrate what I'm doing. I guess I would need to select the right XML node in XPath or something similar, but I have no idea where to put this into the perl script! I've been learning about Perl and XML during the last months, but I'm still taking very early steps.
Thank you for all the help!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Transliteration inside an XML file
by choroba (Cardinal) on Jun 19, 2014 at 11:17 UTC | |
|
Re: Transliteration inside an XML file
by AnomalousMonk (Archbishop) on Jun 19, 2014 at 10:59 UTC | |
by graff (Chancellor) on Jun 19, 2014 at 22:14 UTC | |
|
Re: Transliteration inside an XML file
by mirod (Canon) on Jun 19, 2014 at 14:06 UTC | |
|
Re: Transliteration inside an XML file
by flowdy (Scribe) on Jun 19, 2014 at 10:53 UTC | |
|
Re: Transliteration inside an XML file
by grondilu (Friar) on Jun 19, 2014 at 10:48 UTC | |
by flowdy (Scribe) on Jun 19, 2014 at 11:06 UTC | |
|
Re: Transliteration inside an XML file
by nikop (Initiate) on Jun 19, 2014 at 21:49 UTC |