ateague has asked for the wisdom of the Perl Monks concerning the following question:

Good morning!

I am using XML::Twig to conditionally filter out elements in an XML file and then conditionally "duplex" the output to two different output files. I have managed to jury-rig something that gives me the correct output, but I imagine there is a better, more correct way to accomplish the task that does not involve reprocessing the input file multiple times.

In my sample program below, I am splitting <thing> elements with type attributes of "vegetable" and "fruit" off into separate files. <thing> elements with a "city" attribute are filtered out and deleted. The <header> and <footer> elements are duplexed to both output files. Is there a way to conditionally split target elements off into separate files and duplicate elements "outside" the target element to separate files without having to read the input file multiple times?

#!/usr/bin/perl use 5.018; use strict; use warnings; use XML::Twig; { my $t; my $pos = tell 'DATA'; # save the offset... # Process fruit open (my $FRUIT, '>', './fruit.xml') or die "./fruit.xml:\n$!\n$^E"; $t = XML::Twig->new( twig_handlers => { 'thing' => sub { _filter(@_, 'fruit', $FRUIT); 1; }, 'thing//*' => sub { 1; }, '_default_' => sub { $_[0]->flush($FRUIT); 1; }, '#CDATA' => sub { 1; }, }, pretty_print => 'indented', comments => 'drop', # remove any comments empty_tags => 'normal',# empty tags = <tag/> ); $t->parse(*DATA); close $FRUIT; seek 'DATA', $pos, 0; # reset DATA for the second run-through # Process vegetables open (my $VEG, '>', './veg.xml') or die "./veg.xml:\n$!\n$^E"; $t = XML::Twig->new( twig_handlers => { 'thing' => sub { _filter(@_, 'vegetable', $VEG); 1; }, 'thing//*' => sub { 1; }, '_default_' => sub { $_[0]->flush($VEG); 1; }, '#CDATA' => sub { 1; }, }, pretty_print => 'indented', comments => 'drop', # remove any comments empty_tags => 'normal',# empty tags = <tag/> ); $t->parse(*DATA); close $VEG; } sub _filter { my ($_twig, $thing_element, $keep_me, $PRINT_FILE) = @_; # Flush the twig to file if the 'type' attribute matches... if ( $thing_element->{att}{type} eq $keep_me ) { $_twig->flush($PRINT_FILE); } # ... otherwise delete the twig else { $thing_element->delete(); } return 1; } __DATA__ <batch> <header> <foo>1</foo> <bar>2</bar> <baz>3</baz> </header> <thing type="fruit" >Im an apple!</thing> <thing type="city" >Toronto</thing> <thing type="vegetable" >Im a carrot!</thing> <thing type="city" >Melrose</thing> <thing type="vegetable" >Im a potato!</thing> <thing type="fruit" >Im a pear!</thing> <thing type="vegetable" >Im a pickle!</thing> <thing type="city" >Patna</thing> <thing type="fruit" >Im a banana!</thing> <thing type="vegetable" >Im an eggplant!</thing> <thing type="city" >Taumatawhakatangihangakoauauotamateaturipuk +akapikimaungahoronukupokaiwhenuakitanatahu</thing> <trailer> <chrzaszcz>A</chrzaszcz> <zdzblo>B</zdzblo> </trailer> </batch>

Thank you for your time.

perl -v This is perl 5, version 18, subversion 2 (v5.18.2) built for MSWin32-x +64-multi-thread (with 1 registered patch, see perl -V for more detail)
perl -MXML::Twig -E "say $XML::Twig::VERSION;" 3.48

Replies are listed 'Best First'.
Re: XML::Twig - Filtering and duplexing output to multiple output files
by Loops (Curate) on Nov 12, 2014 at 00:19 UTC

    As long as you don't mind using IO::Tee from cpan to send output to two files at once, the code below works. In the default case you output to both, and in the "thing" case you choose which file (or neither) for output.

    use XML::Twig; use IO::Tee; open my $frufile, '>', 'fruit.xml' or die "fruit $!"; open my $vegfile, '>', 'veg.xml' or die "veg $!"; my $tee = IO::Tee->new($frufile, $vegfile); select $tee; my $twig=XML::Twig->new( twig_handlers => { thing => \&magic, _default_ => sub { $_[0]->flush; 1; }, }, pretty_print => 'indented', empty_tags => 'normal', ); $twig->parse( *DATA ); sub magic { my ($thing, $element) = @_; for ($element->{att}{type}) { if (/fruit/) { $thing->flush($frufile); } elsif (/vegetable/) { $thing->flush($vegfile); } else { $thing->purge; } } 1; } __DATA__ <batch> <header> <foo>1</foo> <bar>2</bar> <baz>3</baz> </header> <thing type="fruit" >Im an apple!</thing> <thing type="city" >Toronto</thing> <thing type="vegetable" >Im a carrot!</thing> <thing type="city" >Melrose</thing> <thing type="vegetable" >Im a potato!</thing> <thing type="fruit" >Im a pear!</thing> <thing type="vegetable" >Im a pickle!</thing> <thing type="city" >Patna</thing> <thing type="fruit" >Im a banana!</thing> <thing type="vegetable" >Im an eggplant!</thing> <thing type="city" >Taumatawhakatangihangakoauauotamateaturipuk +akapikimaungahoronukupokaiwhenuakitanatahu</thing> <trailer> <chrzaszcz>A</chrzaszcz> <zdzblo>B</zdzblo> </trailer> </batch>
      Thank you very much Loops! IO::Tee was exactly what I was looking for.
Re: XML::Twig - Filtering and duplexing output to multiple output files
by Discipulus (Canon) on Nov 12, 2014 at 08:06 UTC
    Hello ateague,
    You had the right answer from Loops, i want to add only few words about your last question: "Is there a way to conditionally split target elements off into separate files and duplicate elements "outside" the target element to separate files without having to read the input file multiple times?" Even without a good module to multiplex the output you have no need to re-read the file. You can just build your custom print sub:
    sub print_dup { foreach my $fh ( @previously_opened_handle_for_write ) { print {$f +h} @_ } }

    You can also review similar question here at perlmonks as this one or my naive approch

    HtH
    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      The problem in this case is the flush method, which does some light magic to output the parts of the tree that haven't been output yet, and then updates the state of the XML::Twig object (so the next call to flush Just Works™). You can't call it twice in a row, with 2 different filehandles, it wouldn't work. That's why using IO::Tee is a brilliant idea.

        obviously the author is right!
        See indeed the docs about flush

        You cannot hack it using print or sprint because they have to be used AFTER parse. lesson learned. Thanks.

        L*
        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.