Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

SOLVED-X-Post from StackOverflow->Strange problem when deleting certain elements in a XML file using XML::Twig.

by Dasen (Initiate)
on Aug 08, 2013 at 10:26 UTC ( [id://1048534]=perlquestion: print w/replies, xml ) Need Help??

Dasen has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone, I'm having the strangest using XML::Twig.

Here's the thing: I've got a huge XML file, you can see a snippet below:

<?xml version="1.0" encoding="utf-16"?> <!DOCTYPE tmx SYSTEM "tmx14.dtd"> <tmx version="1.4"> <header creationtool="MemoQ" creationtoolversion="6.2.21" segtype="sen +tence" adminlang="en-us" creationid="" srclang="pt-pt" o-tmf="MemoQTM +" datatype="unknown"> <prop type="defclient"> </prop> <prop type="defproject"> </prop> <prop type="defdomain"> </prop> <prop type="defsubject"> </prop> <prop type="description"> </prop> <prop type="targetlang">it</prop> <prop type="name">pt_PT-it_IT</prop> </header> <body> <tu changedate="20120625T175037Z" changeid="ana"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">yes</prop> <tuv xml:lang="pt-pt"> <prop type="x-context-pre">&lt;seg&gt;O recinto do Pavilhão Atlâ +ntico, versátil por natureza, está vocacionado para receber os mais v +ariados eventos.&lt;/seg&gt;</prop> <prop type="x-context-post">&lt;seg&gt;A Sala Atlântico, com uma + arena de 5 200 m2, abriga, com uma versatilidade única, todo o tipo +de eventos.&lt;/seg&gt;</prop> <seg>Composto por três áreas integradas, todos os espaços são fa +cilmente adaptados às características de cada evento.</seg> </tuv> <tuv xml:lang="it"> <seg>È composto da tre aree fra di esse integrate, le quali sono + tutte facilmente adattabili alle caratteristiche specifiche di ogni +evento.</seg> </tuv> </tu> <tu changedate="20130625T175037Z" changeid="ana"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">yes</prop> <tuv xml:lang="pt-pt"> <prop type="x-context-post">&lt;seg&gt;Composto por três áreas i +ntegradas, todos os espaços são facilmente adaptados às característic +as de cada evento.&lt;/seg&gt;</prop> <seg>O recinto do Pavilhão Atlântico, versátil por natureza, est +á vocacionado para receber os mais variados eventos.</seg> </tuv> <tuv xml:lang="it"> <seg>Lo spazio di pertinenza del Pavilhão Atlântico, versatile p +er natura, è adatto a ricevere gli eventi più svariati.</seg> </tuv> </tu> <tu changedate="20140625T175037Z" changeid="ana"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">yes</prop> <tuv xml:lang="pt-pt"> <prop type="x-context-post">&lt;seg&gt;Composto por três áreas i +ntegradas, todos os espaços são facilmente adaptados às característic +as de cada evento.&lt;/seg&gt;</prop> <seg>Composto por três áreas integradas, todos os espaços são fa +cilmente adaptados às características de cada evento.</seg> </tuv> <tuv xml:lang="it"> <seg>Lo spazio di pertinenza del Pavilhão Atlântico, versatile p +er natura, è adatto a ricevere gli eventi più svariati.</seg> </tuv> </tu> <tu changedate="20140625T175037Z" changeid="ana"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">yes</prop> <tuv xml:lang="pt-pt"> <prop type="x-context-post">&lt;seg&gt;Composto por três áreas i +ntegradas, todos os espaços são facilmente adaptados às característic +as de cada evento.&lt;/seg&gt;</prop> <seg>Teste</seg> </tuv> <tuv xml:lang="it"> <seg>Lo spazio di pertinenza del Pavilhão Atlântico, versatile p +er natura, è adatto a ricevere gli eventi più svariati.</seg> </tuv> </tu> <tu changedate="20110625T175037Z" changeid="ana"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">yes</prop> <tuv xml:lang="pt-pt"> <prop type="x-context-post">&lt;seg&gt;Composto por três áreas i +ntegradas, todos os espaços são facilmente adaptados às característic +as de cada evento.&lt;/seg&gt;</prop> <seg>Teste</seg> </tuv> <tuv xml:lang="it"> <seg>Lo spazio di pertinenza del Pavilhão Atlântico, versatile p +er natura, è adatto a ricevere gli eventi più svariati.</seg> </tuv> </tu> </body> </tmx>

Basically I have code that: scans the xml file (using XML::Twig), every time it finds a "tu" element it puts its "seg" element in a hash associated with its "changedate" attribute if the changedate is newer than the one in the hash or if the "seg" element doesn't exist in the hash.

Then, I do a second pass, where I compare each "seg" element of every "tu" element with the one in the hash and delete the "tu" element if it's older than the one in the hash.

But enough talking, here's the code:

use 5.010; use strict; use warnings; use XML::Twig; #use Digest::MD5 qw(md5); my $filename = 'pt_PT-it_IT-novo2.tmx'; my $out_filename = 'out.tmx'; open my $out, '>', $out_filename; binmode $out, ':encoding(UTF-8)'; print "PASSAGEM 1...\n"; my $first_pass = new XML::Twig (twig_handlers => {tu => \&first_pass}) +; $first_pass->parsefile($filename); $first_pass->purge(); print "FEITO\n"; print "\nPASSAGEM 2....\n"; my $second_pass = new XML::Twig (#twig_roots => { 'tu' => 1 }, #twig_print_outside_roots => 1, pretty_print => 'indented', twig_handlers => {tu => \&second_pass}); $second_pass->parsefile($filename); close $out; print "\nFEITO\n"; {my %hash; sub first_pass { my($twig, $tu) = @_; my $seg = $tu->first_child('tuv')->first_child('seg')->text; my $changedate = $tu->att('changedate'); #$changedate = substr $changedate, 0, 8; #my $hash = md5($seg); if ( (!(exists($hash{$seg})) ) || (($hash{$seg}) lt $changedate) ) { $hash{$seg} = $changedate; } $twig->purge(); } sub second_pass { my($twig, $tu) = @_; #print $original_tu->text; my $seg = $tu->first_child('tuv')->first_child('seg')->text; my $changedate = $tu->att('changedate'); #$changedate = substr $changedate, 0, 8; #my $hash = md5($seg); if (!(($hash{$seg}) eq $changedate)) { print "================================\n"; print "APAGADO\n"; print $seg; print "\n POIS DATA DE ORIGINAL: "; print $changedate; print " E MAIS ANTIGA QUE ENCONTRADA: "; print $hash{$seg}; print "\n=================================\n"; $tu->delete; } $twig->flush($out); } }

The problem is that I'm getting all kinds of "body" and "header" tags in the middle of the document, here's the above xml document after it's processed:

<?xml version="1.0" encoding="utf-16"?> <!DOCTYPE tmx SYSTEM "tmx14.dtd"> <tmx version="1.4"> <header adminlang="en-us" creationid="" creationtool="MemoQ" creatio +ntoolversion="6.2.21" datatype="unknown" o-tmf="MemoQTM" segtype="sen +tence" srclang="pt-pt"> <prop type="defclient"> </prop> <prop type="defproject"> </prop> <prop type="defdomain"> </prop> <prop type="defsubject"> </prop> <prop type="description"> </prop> <prop type="targetlang">it</prop> <prop type="name">pt_PT-it_IT</prop> </header> <body></body> </tmx> <?xml version="1.0" encoding="utf-16"?> <!DOCTYPE tmx SYSTEM "tmx14.dtd"> <tmx version="1.4"> <header adminlang="en-us" creationid="" creationtool="MemoQ" creatio +ntoolversion="6.2.21" datatype="unknown" o-tmf="MemoQTM" segtype="sen +tence" srclang="pt-pt"> <prop type="defclient"> </prop> <prop type="defproject"> </prop> <prop type="defdomain"> </prop> <prop type="defsubject"> </prop> <prop type="description"> </prop> <prop type="targetlang">it</prop> <prop type="name">pt_PT-it_IT</prop> </header> <body> <tu changedate="20130625T175037Z" changeid="ana"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">yes</prop> <tuv xml:lang="pt-pt"> <prop type="x-context-post">&lt;seg>Composto por três áreas in +tegradas, todos os espaços são facilmente adaptados às característica +s de cada evento.&lt;/seg></prop> <seg>O recinto do Pavilhão Atlântico, versátil por natureza, e +stá vocacionado para receber os mais variados eventos.</seg> </tuv> <tuv xml:lang="it"> <seg>Lo spazio di pertinenza del Pavilhão Atlântico, versatile + per natura, è adatto a ricevere gli eventi più svariati.</seg> </tuv> </tu> <tu changedate="20140625T175037Z" changeid="ana"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">yes</prop> <tuv xml:lang="pt-pt"> <prop type="x-context-post">&lt;seg>Composto por três áreas in +tegradas, todos os espaços são facilmente adaptados às característica +s de cada evento.&lt;/seg></prop> <seg>Composto por três áreas integradas, todos os espaços são +facilmente adaptados às características de cada evento.</seg> </tuv> <tuv xml:lang="it"> <seg>Lo spazio di pertinenza del Pavilhão Atlântico, versatile + per natura, è adatto a ricevere gli eventi più svariati.</seg> </tuv> </tu> <tu changedate="20140625T175037Z" changeid="ana"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">yes</prop> <tuv xml:lang="pt-pt"> <prop type="x-context-post">&lt;seg>Composto por três áreas in +tegradas, todos os espaços são facilmente adaptados às característica +s de cada evento.&lt;/seg></prop> <seg>Teste</seg> </tuv> <tuv xml:lang="it"> <seg>Lo spazio di pertinenza del Pavilhão Atlântico, versatile + per natura, è adatto a ricevere gli eventi più svariati.</seg> </tuv> </tu> </body> </tmx> </body> </tmx>
Dear monks, what's wrong with my code?
  • Comment on SOLVED-X-Post from StackOverflow->Strange problem when deleting certain elements in a XML file using XML::Twig.
  • Select or Download Code

Replies are listed 'Best First'.
Re: Strange problem when deleting certain elements in a XML file using XML::Twig.
by choroba (Cardinal) on Aug 08, 2013 at 10:29 UTC
    Crossposted at StackOverflow. It is considered polite to inform about crossposting so hackers not attending both sites do not waste their efforts solving a problem already sorted out at the other end of the Internets.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Sorry, I wasn't aware of that, I'll edit the post. Dasen
        I've solved it !!!! I changed "$twig->flush($out)" to "$tu->flush($out)" and put it in the else statement of the if clause. Thanks anyway guys! Dasen

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1048534]
Approved by greengaroo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-19 14:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found