[SOLVED] XML::Twig - Filtering and duplexing output to multiple output files

ateague has asked for the wisdom of the Perl Monks concerning the following question:

Good morning!

I am using XML::Twig to conditionally filter out elements in an XML file and then conditionally "duplex" the output to two different output files. I have managed to jury-rig something that gives me the correct output, but I imagine there is a better, more correct way to accomplish the task that does not involve reprocessing the input file multiple times.

In my sample program below, I am splitting <thing> elements with type attributes of "vegetable" and "fruit" off into separate files. <thing> elements with a "city" attribute are filtered out and deleted. The <header> and <footer> elements are duplexed to both output files. Is there a way to conditionally split target elements off into separate files and duplicate elements "outside" the target element to separate files without having to read the input file multiple times?

#!/usr/bin/perl
use 5.018;
use strict;
use warnings;
use XML::Twig;

{
  my $t;
  my $pos = tell 'DATA'; # save the offset...
  
  # Process fruit
  open (my $FRUIT, '>', './fruit.xml') or die "./fruit.xml:\n$!\n$^E";
  $t = XML::Twig->new(
    twig_handlers => {
      'thing'      => sub { _filter(@_, 'fruit', $FRUIT); 1; },
      'thing//*'   => sub { 1; },
      '_default_'  => sub { $_[0]->flush($FRUIT); 1; },
      '#CDATA'     => sub { 1; },
    },
    pretty_print => 'indented',
    comments     => 'drop',  # remove any comments
    empty_tags   => 'normal',# empty tags = <tag/>
  );
  
  $t->parse(*DATA);
  close $FRUIT;
  
  seek 'DATA', $pos, 0; # reset DATA for the second run-through
  
  # Process vegetables
  open (my $VEG, '>', './veg.xml') or die "./veg.xml:\n$!\n$^E";
  $t = XML::Twig->new(
    twig_handlers => {
      'thing'     => sub { _filter(@_, 'vegetable', $VEG); 1; },
      'thing//*'  => sub { 1; },
      '_default_' => sub { $_[0]->flush($VEG); 1; },
      '#CDATA'    => sub { 1; },
    },
    pretty_print => 'indented',
    comments     => 'drop',  # remove any comments
    empty_tags   => 'normal',# empty tags = <tag/>
  );
  
  $t->parse(*DATA);
  close $VEG;
}

sub _filter {
  my ($_twig, $thing_element, $keep_me, $PRINT_FILE) = @_;
  
  # Flush the twig to file if the 'type' attribute matches...
  if ( $thing_element->{att}{type} eq $keep_me ) {
    $_twig->flush($PRINT_FILE);
  }
  
  # ... otherwise delete the twig
  else {
    $thing_element->delete();
  }
  
  return 1;
}

__DATA__
<batch>
  <header>
    <foo>1</foo>
    <bar>2</bar>
    <baz>3</baz>
  </header>
  <thing type="fruit"     >Im an apple!</thing>
  <thing type="city"      >Toronto</thing>
  <thing type="vegetable" >Im a carrot!</thing>
  <thing type="city"      >Melrose</thing>
  <thing type="vegetable" >Im a potato!</thing>
  <thing type="fruit"     >Im a pear!</thing>
  <thing type="vegetable" >Im a pickle!</thing>
  <thing type="city"      >Patna</thing>
  <thing type="fruit"     >Im a banana!</thing>
  <thing type="vegetable" >Im an eggplant!</thing>
  <thing type="city"      >Taumatawhakatangihangakoauauotamateaturipuk
+akapikimaungahoronukupokaiwhenuakitanatahu</thing>
  <trailer>
    <chrzaszcz>A</chrzaszcz>
    <zdzblo>B</zdzblo>
  </trailer>
</batch>
[download]

Thank you for your time.

perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for MSWin32-x
+64-multi-thread
(with 1 registered patch, see perl -V for more detail)
[download]

perl -MXML::Twig -E "say $XML::Twig::VERSION;"
3.48
[download]

Comment on [SOLVED] XML::Twig - Filtering and duplexing output to multiple output files Select or Download Code

Replies are listed 'Best First'.
Re: XML::Twig - Filtering and duplexing output to multiple output files by Loops (Curate) on Nov 12, 2014 at 00:19 UTC
As long as you don't mind using IO::Tee from cpan to send output to two files at once, the code below works. In the default case you output to both, and in the "thing" case you choose which file (or neither) for output. use XML::Twig; use IO::Tee; open my $frufile, '>', 'fruit.xml' or die "fruit $!"; open my $vegfile, '>', 'veg.xml' or die "veg $!"; my $tee = IO::Tee->new($frufile, $vegfile); select $tee; my $twig=XML::Twig->new( twig_handlers => { thing => \&magic, _default_ => sub { $_[0]->flush; 1; }, }, pretty_print => 'indented', empty_tags => 'normal', ); $twig->parse( *DATA ); sub magic { my ($thing, $element) = @_; for ($element->{att}{type}) { if (/fruit/) { $thing->flush($frufile); } elsif (/vegetable/) { $thing->flush($vegfile); } else { $thing->purge; } } 1; } __DATA__ <batch> <header> <foo>1</foo> <bar>2</bar> <baz>3</baz> </header> <thing type="fruit" >Im an apple!</thing> <thing type="city" >Toronto</thing> <thing type="vegetable" >Im a carrot!</thing> <thing type="city" >Melrose</thing> <thing type="vegetable" >Im a potato!</thing> <thing type="fruit" >Im a pear!</thing> <thing type="vegetable" >Im a pickle!</thing> <thing type="city" >Patna</thing> <thing type="fruit" >Im a banana!</thing> <thing type="vegetable" >Im an eggplant!</thing> <thing type="city" >Taumatawhakatangihangakoauauotamateaturipuk +akapikimaungahoronukupokaiwhenuakitanatahu</thing> <trailer> <chrzaszcz>A</chrzaszcz> <zdzblo>B</zdzblo> </trailer> </batch> [download]	[reply] [d/l]
Re^2: XML::Twig - Filtering and duplexing output to multiple output files by ateague (Monk) on Nov 12, 2014 at 22:06 UTC
Thank you very much Loops! `IO::Tee` was exactly what I was looking for.	[reply] [d/l]
Re: XML::Twig - Filtering and duplexing output to multiple output files by Discipulus (Canon) on Nov 12, 2014 at 08:06 UTC
Hello ateague, You had the right answer from Loops, i want to add only few words about your last question: "Is there a way to conditionally split target elements off into separate files and duplicate elements "outside" the target element to separate files without having to read the input file multiple times?" Even without a good module to multiplex the output you have no need to re-read the file. You can just build your custom print sub: `sub print_dup { foreach my $fh ( @previously_opened_handle_for_write ) { print {$f +h} @_ } }` [download] You can also review similar question here at perlmonks as this one or my naive approch HtH L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^2: XML::Twig - Filtering and duplexing output to multiple output files by mirod (Canon) on Nov 12, 2014 at 14:14 UTC
The problem in this case is the `flush` method, which does some light magic to output the parts of the tree that haven't been output yet, and then updates the state of the XML::Twig object (so the next call to `flush` Just Works™). You can't call it twice in a row, with 2 different filehandles, it wouldn't work. That's why using `IO::Tee` is a brilliant idea.	[reply] [d/l] [select]
Re^3: XML::Twig - Filtering and duplexing output to multiple output files by Discipulus (Canon) on Nov 13, 2014 at 08:47 UTC
obviously the author is right! See indeed the docs about flush You cannot hack it using print or sprint because they have to be used AFTER parse. lesson learned. Thanks. L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]