Xenofur has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perlmonks,

I'm trying to convert an xml file i have into a more readable csv file. it contains things like this:
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE xliff PUBLIC "-//XLIFF//DTD XLIFF//EN" "http://www.oasis-ope +n.org/committees/xliff/documents/xliff.dtd" > <xliff version="1.0"> <file datatype="plaintext" original="game/stringtable/Example.stt" so +urce-language="en" target-language="de" xml:space="default"> <header/> <body> <trans-unit approved="yes" id="Example_monsternpc_greeting" reforma +t="yes" translate="yes" xml:space="default"> <source>'I am a monster NPC.'</source><target state="translated" x +ml:lang="de">'Ich bin ein Monster-NSC.'</target> <prop-group><prop prop-type="entryid">1173668791078</prop></prop-gr +oup><alt-trans match-quality="100" origin="5k" xml:space="default"><s +ource xml:lang="en">'I am a monster NPC.'</source><target xml:lang="d +e">'Ich bin ein Monster-NSC.'</target><prop-group><prop prop-type="en +tryid">1173668807656</prop></prop-group></alt-trans></trans-unit> <trans-unit approved="yes" id="Example_monsternpc_pc_transform" ref +ormat="yes" translate="yes" xml:space="default"> <source>'Turn into a monster.'</source><target xml:lang="de">'In e +in Monster verwandeln.'</target> <prop-group><prop prop-type="entryid">1173668791093</prop></prop-gr +oup><alt-trans match-quality="100" origin="5k" xml:space="default"><s +ource xml:lang="en">'Turn into a monster.'</source><target xml:lang=" +de">'In ein Monster verwandeln.'</target><prop-group><prop prop-type= +"entryid">1173668807671</prop></prop-group></alt-trans></trans-unit> <trans-unit approved="yes" id="Example_private_testMe2" reformat="y +es" translate="yes" xml:space="default"> <source>and again</source><target xml:lang="de">und noch einer</ta +rget> <prop-group><prop prop-type="entryid">1173668791109</prop></prop-gr +oup><alt-trans match-quality="100" origin="5k" xml:space="default"><s +ource xml:lang="en">and again</source><target xml:lang="de">und noch +einer</target><prop-group><prop prop-type="entryid">1173668807687</pr +op></prop-group></alt-trans></trans-unit> </body> </file> <file datatype="plaintext" original="game/stringtable/training_skelet +on.stt" source-language="en" target-language="de" xml:space="default" +> <header/> <body> <trans-unit approved="yes" id="training_skeleton_ID_KillMe" reforma +t="yes" translate="yes" xml:space="default"> <source>You should get to know Sigmund before leaving to explore S +tormreach.</source><target xml:lang="de">Ihr solltet Sigmund kennenle +rnen, bevor Ihr Euch aufmacht, um Stormreach zu erkunden.</target> <prop-group><prop prop-type="entryid">1173668791328</prop></prop-gr +oup><alt-trans match-quality="100" origin="5k" xml:space="default"><s +ource xml:lang="en">You should get to know Sigmund before leaving to +explore Stormreach.</source><target xml:lang="de">Ihr solltet Sigmund + kennenlernen, bevor Ihr Euch aufmacht, um Stormreach zu erkunden.</t +arget><prop-group><prop prop-type="entryid">1173668807906</prop></pro +p-group></alt-trans></trans-unit> </body> </file> </xliff>
My script to convert it looks as follows:
#!/bin/perl use strict; use warnings; use XML::Simple; my $config = XMLin("meep.xlf", ForceArray => 1); open(EXTR, ">meep.csv") or die $!; foreach my $i ( 0 .. $#{ $config->{file} } ) { my $file = $config->{file}->[$i]->{original}; $file =~ s/game\/stringtable\///i; for my $entry ( keys %{ $config->{file}->[$i]->{body}->[0]->{'tran +s-unit'} } ) { my $source = $config->{file}->[$i]->{body}->[0]->{'trans-unit' +}->{$entry}->{source}->[0]; my $target = $config->{file}->[$i]->{body}->[0]->{'trans-unit' +}->{$entry}->{target}->[0]->{content}; print EXTR '"'.$file.'","'.$entry.'","'.$source.'","'.$target. +'"'."\n"; } } close EXTR;
And the output is this:
"Example.stt","Example_monsternpc_greeting","'I am a monster NPC.'","' +Ich bin ein Monster-NSC.'" "Example.stt","Example_private_testMe2","and again","und noch einer" "Example.stt","Example_monsternpc_pc_transform","'Turn into a monster. +'","'In ein Monster verwandeln.'" "training_skeleton.stt","training_skeleton_ID_KillMe","You should get +to know Sigmund before leaving to explore Stormreach.","Ihr solltet S +igmund kennenlernen, bevor Ihr Euch aufmacht, um Stormreach zu erkund +en."
The problem i have with this is: For some reason it has decided to change the order in which the elements are in the original file. (And it apprently does so randomly.) Is there a way to tell XMLIn something to the effect of: Just leave that stuff in the order you found it!

Or have i done something fundamentally wrong? :/

Greetings, Xenofur

Replies are listed 'Best First'.
Re: sort order of imported xml data?
by roboticus (Chancellor) on Mar 30, 2007 at 11:34 UTC
    Xenofur:

    XML::Simple returns a hash, which has no particular order. There are multiple ways around this, if order is important to you.

  • You could use the ForceArray=>[names] option to put the names entities in an array, which should preserve their order, or
  • You might try XML::Twig so you can capture the items in order with your callback. This will have the advantage that you can process files much larger than you could hold in RAM.

    ...roboticus

    Update: Added code tags around ForceArray bit to fix square quotes...

      It is not enough to use ForceArray in this case, because the XML that the OP presents uses key attributes. The regular elements will be forced into an array structure but the key attributes would still live in an unsorted hash.

      The solution is to combine ForceArray with the option KeyAttr to also force key attributes into an array.

      my $config = XMLin("meep.xlf", ForceArray => 1, KeyAttr =>[]);

      NB: This changes the structure that XML::Simple returns so you will need to update your code a little.

      I tend to use Data::Dumper to inspect the hash that XML::Simple returns to check if that meets my needs.

        ++varian:

        ...and that's what I get for guessing instead of trying it out.

        Thanks for the catch! (Yarch, two slipped gears on the same day. I'll have to be sure not to skip my morning coffee ... um, err, uh ....oil! Yeah, that's the ticket.....)

        ...roboticus

        Sorry for the late replies, work got over-busy and this project set on the backburner, but i'm back on it and a good step ahead thanks to your input.

        And i have to say, as for quickly solving my problem with the least effort, varian is right on spot! :)
        I should've known that i should've read the documentation more closely. ^^

        Also, yes, i've been basically using Data::Dumper all the time, but cut it out of my example code as not relevant. I however use it to basically reverse construct my access code. =)
Re: sort order of imported xml data?
by f00li5h (Chaplain) on Mar 30, 2007 at 11:23 UTC

    I suspect it's because of the hash that you're getting back, what with hashes not having an order and all...

    Perhaps you can get a list of attribute names (ordered) and then get the values from them when processing the xml?

    Update roboticus speaks great wisdom on this in Re: sort order of imported xml data?. It's clear he used those extra 11 minutes to figure out what he's talking about.

    @_=qw; ask f00li5h to appear and remain for a moment of pretend better than a lifetime;;s;;@_[map hex,split'',B204316D8C2A4516DE];;y/05/os/&print;
Re: sort order of imported xml data?
by Jenda (Abbot) on Mar 30, 2007 at 20:21 UTC

    What about using a different module?

    #!/bin/perl use strict; use warnings; use XML::Rules; my $parser = XML::Rules->new( start_rules => [ file => sub { my ($tag_name, $attrs, $context, $parent_data, $parser) = +@_; $parser->{pad}{file} = $attrs->{original}; $parser->{pad}{file} =~ s{game/stringtable/}{}i; }, ], rules => [ source => 'content', target => 'content', _default => '', 'trans-unit' => sub { my ($tag_name, $attrs, $context, $parent_data, $parser) = +@_; print EXTR qq{"$parser->{pad}{file}","$attrs->{id}","$attr +s->{source}","$attrs->{target}"\n}; return; }, ], ); open(EXTR, ">meep.csv") or die $!; $parser->parsefile( "meep.xlf");
    Using this you may handle hundreds of megabytes large XMLs of this structure as this 1) stores only what it needs to store of the subtags of <trans-unit> and 2) as soon as it's done with a <trans-unit> tag it forgets all its data.

    What the script does is that when the parser parses the opening tag of <file> the script tweaks and remembers the original attribute in a "pad" - an attribute of the parser object designated to hold the script specific data. Then whenever it parses the complete <source> or <target> tag it remembers just the content and makes it readily available in the atribute hash of the parent tag and then whenever it parses the complete <trans-unit> tag it prints the remembered file name, the id attribute and the contents of the <source> and <target> subtags. And forgets the data of that tag.

    Here is a version without using the pad and using a lexical filehandle:

    #!/bin/perl use strict; use warnings; use XML::Rules; my $parser = XML::Rules->new( start_rules => [ file => sub { my ($tag_name, $attrs) = @_; $attrs->{original} =~ s{game/stringtable/}{}i; return 1; }, ], rules => [ 'trans-unit' => sub { my ($tag_name, $attrs, $context, $parent_data, $parser) = +@_; my $file = $parent_data->[-2]{original}; print {$parser->{parameters}{FH}} qq{"$file","$attrs->{id} +","$attrs->{source}","$attrs->{target}"\n}; # or # my $FH = $parser->{parameters}{FH}; #print $FH qq{"$file","$attrs->{id}","$attrs->{source}","$ +attrs->{target}"\n}; return; }, source => 'content', target => 'content', _default => '', ], ); open(my $EXTR, ">meep.csv") or die $!; $parser->parsefile( "meep.xlf", {FH => $EXTR}); close $EXTR;