zacc has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying (reasonably successfully, to a point) XML files using XML-Twig, however I'm encountering some undesirable output issues. Input file is received, of the format
<EXPORT> <OUTPUT>2008-01-01</OUTPUT> <RECORD>.....</RECORD> <RECORD>.....</RECORD> .... </EXPORT>
I'm looking to retrieve certain RECORD elements from the file - which I can isolate using handlers without any problem (the code below is just a sample).
sub record_check { my( $nhf, $record)= @_; my $member = $record->first_child('ID')->text; if ( $member <1000 ) { $nhf->flush(); print "\n"; } $nhf->purge(); } my $nhf= XML::Twig->new( twig_roots => { RECORD => 1, },twig_handlers +=> { RECORD => \&record_check,}, ); $nhf->parsefile( $input_file ); $nhf->purge;
I was expecting to see
<RECORD>...........</RECORD> <RECORD>...........</RECORD> ....
but instead, if the first RECORD matches my query, I get
<EXPORT><RECORD>............</RECORD> <RECORD>...........</RECORD> ....
everything is OK if the first RECORD doesn't match.

Anyone got any ideas how I can get rid of the unwanted <EXPORT> at the beginning of the output; I had hoped using TWIG_ROOT would solve the problem, but apparently not.

BTW - The input file is HUGE, so use of triggers is a MUST as otherwise building the entire tree takes an enormous amount of time and memory.

Replies are listed 'Best First'.
Re: XML Twig - Isolate Element
by GrandFather (Saint) on Jan 03, 2008 at 17:28 UTC

    The following seems to do what you want:

    use strict; use warnings; use XML::Twig; my $nhf = XML::Twig->new( twig_roots => { RECORD => \&record_check, }, + ); $nhf->parse(<<'XML'); <EXPORT> <OUTPUT>2008-01-01</OUTPUT> <RECORD><ID>1</ID></RECORD> <RECORD><ID>2</ID></RECORD> </EXPORT> XML sub record_check { my ( $nhf, $record ) = @_; my $member = $record->first_child('ID')->text(); if ( $member < 1000 ) { $record->print(); print "\n"; } $nhf->purge (); }

    Prints:

    <RECORD><ID>1</ID></RECORD> <RECORD><ID>2</ID></RECORD>

    Perl is environmentally friendly - it saves trees
      HMmm, I'm using 3.26 here (shoulda mentioned that earlier).

      I've tried "Print" but that repeats <EXPORT> for every line output... which is even worse.

      I'll upgrade to the latest and try again.

        Did you actually try the sample code? The output shown was with 3.26.

        Note that the print is called on $record, not $nhf.


        Perl is environmentally friendly - it saves trees
Re: XML Twig - Isolate Element
by mirod (Canon) on Jan 03, 2008 at 17:18 UTC

    I was in the middle of writing a long explanation on why this is normal, when it occured to me to try your code, and apparently it works just fine in XML::Twig 3.32. Which is completely wrong (not to mention baffling!). So I think I need to investigate this one a bit.

    In any case the gist of the explanation was that XML::Twig works by storing trees. In order to ensure that even when twig_roots is used, it always store the root of the document. Which flush (should!) duly print when asked to. If you just use print instead of flush, then you should get what you want (and the subsequent purge should get rid of the element anyway, so you shouldn't see any increase in memory usage).

    Does that help?In any case it looks like you helped me find a regression in the code, so thanks a lot.

      OK, I've upgraded the module to 3.32 - and FLUSH works as I would have expect (but obv not as you intended !).

      PRINT dumps out the header line for every row, which it also did under 3.- which is why I was using FLUSH.

        Sorry, I should have been more clear: you should print $record, not $nhf, this way you print exactly what you want, as GrandFather mentioned.

        Well, would you know it, it appears that this is not a bug. I added this in 3.30, with a proper test (but apparently not too much docs apart from a comment in the code): when twig_roots are used, then the root is considered flushed, so flush will not output it.

        So your original code was Ok after all, and my answer should have been an easy "I know, upgrade the module, it's fixed in the current version". ;--)