benizi has asked for the wisdom of the Perl Monks concerning the following question:

I'm sure I'm missing some interaction between the various options, but I was wondering if someone (mirod?) could tell me how to accomplish the following. I have an XML document that I want to process with XML::Twig. I want the output document to retain the formatting characteristics of the input document. I also want to use twig_roots, since the file will not fit into memory, and is record-based. (i.e. the processing of each record is self-contained.). I used twig_print_outside_roots, because I want to specify a filehandle for the default prints/flushes. (Is there something more appropriate for that purpose?) The problem is that the root (wrapper) element's start tag is being output twice.

Example input:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE foo SYSTEM "/path"> <foo version="blah"> <record>stuff</record> <record>stuff 2</record> </foo>

Desired output:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE foo SYSTEM "/path"> <foo version="blah"> <record>altered stuff</record> <record>altered stuff 2</record> </foo>

My attempt:

#!/usr/bin/perl use strict; use warnings; use XML::Twig; open my $outfh, '>', "out.xml" or die ">out.xml:$!"; my $p = XML::Twig->new( twig_print_outside_roots => $outfh, twig_roots => { record => sub { $_->set_text("altered ".$_->text); shift->flush } }, empty_tags => 'html', keep_encoding => 1, keep_spaces => 1, ); $p->parsefile("in.xml"); print "DONE\n";

Actual output:

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE foo SYSTEM "/path"> <foo version="blah"> <foo version="blah"><record>altered stuff</record> # the extra <foo ve +rsion="blah"> is the problem. <record>altered stuff 2</record> </foo>

Replies are listed 'Best First'.
Re: XML::Twig outputting root element start tag twice
by Tanktalus (Canon) on Apr 18, 2006 at 18:34 UTC

    First, you're missing the end-flush. Before you're really "DONE", you need to add $p->flush(). That gets the extra </foo> tag you're missing in your output. Not that you want it, but once you fix the other problem, you'll want it back.

    Second, it's the twig_print_outside_roots flag that's doing it. Remove that. Instead, change your flush calls (including the new one) to have the param "$outfh". Now you'll flush to that file.

    That leaves me with:

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; open my $outfh, '>', "out.xml" or die ">out.xml:$!"; my $p = XML::Twig->new( #twig_print_outside_roots => $outfh, twig_roots => { record => sub { $_->set_text("altered ".$_->text); shift->flush($outfh), } }, empty_tags => 'html', keep_encoding => 1, keep_spaces => 1, ); $p->parsefile("in.xml"); $p->flush($outfh); print "DONE\n";
    As to why, ... I'm not sure.

    Hope that helps,

    Update: Ok, I see you really want the twig_print_outside_roots feature. It doesn't seem to do what you want it to, though. I am curious, though, as to why the formatting matters - this is XML, after all...

      Explicitly adding the $outfh is part of what I was avoiding, as it's not in the scope of the actual handlers in the real-life example. (XML::Twig has so much DWIMmery, I assumed specifying an output filehandle would be something pretty trivial.)

      As to the formatting, it's because, while I'm using XML::Twig, other people in the project aren't (yet!), and the line-based -ness of the format is easier for them to handle. (Plus, I simply prefer the aesthetics of it.)

        Why is the output filehandle not in scope? If it is available when you create the twig, you should be able to use it in the handlers (you can use a closure to pass it to the handlers). You could also use select to send all output to the filehandle, even though I would consider not so good for the maintenability of the code.

        Finally, if you are using the latest version of XML::Twig, you don't need the final flush, it's done automagically.