Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Buggy output from XML::Twig on a Tee

by seki (Monk)
on Feb 25, 2016 at 10:33 UTC ( [id://1156113]=perlquestion: print w/replies, xml ) Need Help??

seki has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am trying to split an xml file into multiple well-formed fragments, and an ancient solution given here: Re: XML::Twig - Filtering and duplexing output to multiple output files is doing pretty well what I am looking for with help of XML::Twig that spits into a Tee... at least with simple data input.

If I complicate a little bit the data structure by regrouping the nodes to filter into a parent node, the second file is not well formed: the parent node is missing its opening tag. And I am quite lost to find the cause.

SSCCE (the difference with initial example is the <thing_list> that contains the <thing>'s):
use XML::Twig; use IO::Tee; use feature 'say'; open my $frufile, '>', 'fruit.xml' or die "fruit $!"; open my $vegfile, '>', 'veg.xml' or die "veg $!"; my $tee = IO::Tee->new($frufile, $vegfile); select $tee; my $twig=XML::Twig->new( twig_handlers => { thing => \&magic, _default_ => sub { say STDOUT '_default_ for '.$_->name; $_[0]->flush($tee); #default filehandle = tee 1; }, }, pretty_print => 'indented', empty_tags => 'normal', ); $twig->parse( *DATA ); sub magic { my ($thing, $element) = @_; say STDOUT "magic for ". $element->{att}{type}; for ($element->{att}{type}) { if (/fruit/) { $thing->flush($frufile); } elsif (/vegetable/) { $thing->flush($vegfile); } else { $thing->purge; } } 1; } __DATA__ <batch> <header> <foo>1</foo> <bar>2</bar> <baz>3</baz> </header> <thing_list> <thing type="fruit" >Im an apple!</thing> <thing type="city" >Toronto</thing> <thing type="vegetable" >Im a carrot!</thing> <thing type="city" >Melrose</thing> <thing type="vegetable" >Im a potato!</thing> <thing type="fruit" >Im a pear!</thing> <thing type="vegetable" >Im a pickle!</thing> <thing type="city" >Patna</thing> <thing type="fruit" >Im a banana!</thing> <thing type="vegetable" >Im an eggplant!</thing> <thing type="city" >Taumatawhakatangihangakoauauotamateaturipuk +akapikimaungahoronukupokaiwhenuakitanatahu</thing> </thing_list> <trailer> <chrzaszcz>A</chrzaszcz> <zdzblo>B</zdzblo> </trailer> </batch>
While the first "fruit.xml" is ok:
<batch> <header> <foo>1</foo> <bar>2</bar> <baz>3</baz> </header> <thing_list> <thing type="fruit">Im an apple!</thing> <thing type="fruit">Im a pear!</thing> <thing type="fruit">Im a banana!</thing> </thing_list> <trailer> <chrzaszcz>A</chrzaszcz> <zdzblo>B</zdzblo> </trailer> </batch>
the "veg.xml" is missing an opening tag for <thing_list>
<batch> <header> <foo>1</foo> <bar>2</bar> <baz>3</baz> </header> <thing type="vegetable">Im a carrot!</thing> <thing type="vegetable">Im a potato!</thing> <thing type="vegetable">Im a pickle!</thing> <thing type="vegetable">Im an eggplant!</thing> </thing_list> <trailer> <chrzaszcz>A</chrzaszcz> <zdzblo>B</zdzblo> </trailer> </batch>

I have also noticed that if I comment out the <thing_list> tags into the data, the comment corresponding to the opening tag is also missing from veg.xml, but not from fruit.xml...

WFIW, I am using Strawberry's Perl 5.20.1 on a Windows 7 box

Update: in the case of the comments, I seem to understand that the first comment is coming while processing the first <thing> and the second should be processed from the _default_ handler while processing the rest of the file. But I do not understand if it is the same while <thing_list> is not commented.

Replies are listed 'Best First'.
Re: Buggy output from XML::Twig on a Tee
by toolic (Bishop) on Feb 25, 2016 at 14:54 UTC

      Yes, I am a newcomer here, while being a regular StackO citizen. Also PerlMonks seemed very calm this morning...

      I am getting some results from StackO and will post here any helpful result.

      The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian
Re: Buggy output from XML::Twig on a Tee
by ateague (Monk) on Feb 25, 2016 at 14:54 UTC

    Quick question: what is the desired output?

    With <thing_list> or without?

      While dispatching the <thing> elements between the two files, I would expect each one being well formed, i.e. having both opening and closing tags for <thing_list> with perhaps no child elements inside.

        Pretty much what Ikegami said on SO. You can see the order that the twigs and residual tags are being processed by modfying the code as follows:

        use XML::Twig; use IO::Tee; use feature 'say'; open my $frufile, '>', 'fruit.xml' or die "fruit $!"; open my $vegfile, '>', 'veg.xml' or die "veg $!"; my $tee = IO::Tee->new($frufile, $vegfile); select $tee; my $twig=XML::Twig->new( twig_handlers => { thing => \&magic, _default_ => sub { print '_default_ for '.$_->name." [[["; $_[0]->flush($tee); #default filehandle = tee say "]]]"; 1; }, }, pretty_print => 'none', empty_tags => 'normal', ); $twig->parse( *DATA ); sub magic { my ($thing, $element) = @_; print "magic for ". $element->{att}{type}." [[["; for ($element->{att}{type}) { if (/fruit/) { $thing->flush($frufile); } elsif (/vegetable/) { $thing->flush($vegfile); } else { $thing->purge; } } say "]]]"; 1; } __DATA__ <batch> <header> <foo>1</foo> <bar>2</bar> <baz>3</baz> </header> <thing_list> <thing type="fruit" >Im an apple!</thing> <thing type="city" >Toronto</thing> <thing type="vegetable" >Im a carrot!</thing> <thing type="city" >Melrose</thing> <thing type="vegetable" >Im a potato!</thing> <thing type="fruit" >Im a pear!</thing> <thing type="vegetable" >Im a pickle!</thing> <thing type="city" >Patna</thing> <thing type="fruit" >Im a banana!</thing> <thing type="vegetable" >Im an eggplant!</thing> <thing type="city" >Taumatawhakatangihangakoauauotamateaturipuk +akapikimaungahoronukupokaiwhenuakitanatahu</thing> </thing_list> <trailer> <chrzaszcz>A</chrzaszcz> <zdzblo>B</zdzblo> </trailer> </batch>

        Fruit.xml:

        _default_ for foo[[[<batch><header><foo>1</foo>]]] _default_ for bar[[[<bar>2</bar>]]] _default_ for baz[[[<baz>3</baz>]]] _default_ for header[[[</header>]]] magic for fruit [[[<thing_list><thing type="fruit">Im an apple!</thing +>]]] magic for city [[[]]] magic for vegetable [[[]]] magic for city [[[]]] magic for vegetable [[[]]] magic for fruit [[[<thing type="fruit">Im a pear!</thing>]]] magic for vegetable [[[]]] magic for city [[[]]] magic for fruit [[[<thing type="fruit">Im a banana!</thing>]]] magic for vegetable [[[]]] magic for city [[[]]] _default_ for thing_list[[[</thing_list>]]] _default_ for chrzaszcz[[[<trailer><chrzaszcz>A</chrzaszcz>]]] _default_ for zdzblo[[[<zdzblo>B</zdzblo>]]] _default_ for trailer[[[</trailer>]]] _default_ for batch[[[</batch>]]]

        Veg.xml

        _default_ for foo[[[<batch><header><foo>1</foo>]]] _default_ for bar[[[<bar>2</bar>]]] _default_ for baz[[[<baz>3</baz>]]] _default_ for header[[[</header>]]] magic for fruit [[[]]] magic for city [[[]]] magic for vegetable [[[<thing type="vegetable">Im a carrot!</thing>]]] magic for city [[[]]] magic for vegetable [[[<thing type="vegetable">Im a potato!</thing>]]] magic for fruit [[[]]] magic for vegetable [[[<thing type="vegetable">Im a pickle!</thing>]]] magic for city [[[]]] magic for fruit [[[]]] magic for vegetable [[[<thing type="vegetable">Im an eggplant!</thing> +]]] magic for city [[[]]] _default_ for thing_list[[[</thing_list>]]] _default_ for chrzaszcz[[[<trailer><chrzaszcz>A</chrzaszcz>]]] _default_ for zdzblo[[[<zdzblo>B</zdzblo>]]] _default_ for trailer[[[</trailer>]]] _default_ for batch[[[</batch>]]]

        Notice how the "<thing>" handler gets called after the tag is closed and flushes any other previous, unprocessed tag information before it.

Re: Buggy output from XML::Twig on a Tee
by mr_ron (Chaplain) on Feb 26, 2016 at 16:25 UTC

    I don't use stackoverflow much and couldn't comment there, which might have been more appropriate, but I took Ikegami's solution to your stackoverflow post and modified it just a little as below and the result seemed to do what you wanted with one pass.

    my $tee = IO::Tee->new($frufile, $vegfile); my $twig = XML::Twig->new( ... etc. twig_print_outside_roots => $tee, ... ); $twig->parse( *DATA );
    Ron

      yes the answer of Ikegami (using the twig_roots instead of the twig_handlers) produce a result as asked, in a single pass.

      But it is not the solution that I will use because of some additional needs :p : in the 'common' part of the produced files, an identifier need to be different in each file, thus when parsing that identifier with XML::Twig, I cannot presume of the value to put in, because I do not know the number of final files.

      It results in a complicated program where I need to keep the beginning of the file in memory to let it be updated differently each time I write a chunk of <thing> in a separate file, while not keeping all the <thing>'s in memory because of the quantity of data... I am tweaking an XML::SAX::Writer based solution instead.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1156113]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (2)
As of 2024-04-20 04:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found