Is it possible to parse an XML file recursively using XML::Twig?

Ppeoc has asked for the wisdom of the Perl Monks concerning the following question:

How can I parse a huge highly nested XML document to display each leaf element. I want to do this recursively where the child is thrown into the recursive function the leaf is found. Is this possible using XML:TWIG?

Here is a sample code I want to parse

<?xml version="1.0" encoding="UTF-8"?>
-<ArrayOfBooks>
-<Book>
<Title>The book of books</Title>
<Author>Sally</Author>
<Released>1/2/2008</Released>
</Book>
-<Book>
<Title>The page of pages</Title>
<Author>Amanda</Author>
<Released>6/3/1998</Released>
</Book>
<Book>
<Title>The book of pages</Title>
<Author>John</Author>
<Released>6/22/1963</Released>
</Book>
<Book>
<Title>The rock of ages</Title>
<Author>Frank</Author>
<Released>5/21/2004</Released>
</Book>
<Book>
<Title>The age of rocks</Title>
<Author>Mary</Author>
<Released>8/16/1944</Released>
</Book>
</ArrayOfBooks>
[download]

Say I want to parse all the children in the snippet code below. I want to recursively parse each node till the last child is reached. I have no information about the nodes as they are dynamic. My actual file is highly nested. How can I do this using Twig?

my $twig = XML::Twig->new(
                          twig_handlers => {  '/ArrayOfBooks/Book' => 
+\&sect} 
                                                   );
my $input ='Books.XML';     
$twig->parsefile($input);  


sub sect
{

    my ($twig, $ele) = @_;
depth++
    #if node does not contain children
{
end_element($ele);
depth--;
return;
}
#node contains children
sec($ele);
}

sub end_element
{
# I need both key and value. eg Title: The page of pages
my ($leaf) = @_;
print $key;
}
[download]

Comment on Is it possible to parse an XML file recursively using XML::Twig? Select or Download Code

Replies are listed 'Best First'.
Re: Parsing a huge XML document recursively by choroba (Cardinal) on Oct 19, 2015 at 17:03 UTC
I tried with XML::LibXML::Reader instead. It's just an example (it doesn't handle processing instructions, comments, entities etc.), but it could get you started. `#!/usr/bin/perl use warnings; use strict; use XML::LibXML::Reader; my $reader = 'XML::LibXML::Reader'->new(location => shift) or die; my $node; while ($reader->read) { if ($reader->nodeType == XML_READER_TYPE_ELEMENT) { $node = $reader->copyCurrentNode(0); } elsif ($reader->nodeType == XML_READER_TYPE_TEXT) { $node->appendText($reader->copyCurrentNode(0)); } elsif ($reader->nodeType == XML_READER_TYPE_END_ELEMENT) { print $node, "\n"; } }` [download] `copyCurrentNode(0)` creates a "shallow" copy, i.e. it doesn't go into the subtree. It keeps the attributes, though. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: Parsing a huge XML document recursively by Jenda (Abbot) on Oct 19, 2015 at 20:08 UTC
Yes, though you'll most probably want to process a whole twig, not just the individual leafs. Without more details about the XML it's impossible to be more specific. Another contender would be XML::Rules. With the right handlers it can process huge files easily. Update: I do not use XML::Twig myself. This is an example of what you could do with XML::Rules: `use strict; use XML::Rules; my $parser = XML::Rules->new( stripspaces => 7, rules => { _default => 'content', Book => sub { my ($tag,$attr) = @_; if (!$attr->{Title}) { return; }# no title, OK. ignore print "$attr->{Title} by $attr->{Author} was released $att +r->{Released}\n"; return; }, }, ); $parser->parse(\DATA); # or $parser->parsefile('path/to/the/file');` [download] Jenda Enoch was right!* Enjoy the last years of Rome.	[reply] [d/l]
Re^2: Parsing a huge XML document recursively by Ppeoc (Beadle) on Oct 20, 2015 at 04:06 UTC
I have edited the question adding more details. Could you please help me?	[reply]
Re: Is it possible to parse an XML file recursively using XML::Twig? by mr_ron (Deacon) on Oct 20, 2015 at 15:37 UTC
I like XPath and found some XPath solutions to the problem on stackoverflow here including this one: //[not(child::)]. XML::Twig didn't seem to handle the XPath expressions on stackoverflow but XML::XPath did. XML::XPath also comes with an 'xpath' grep like utility that might give you a solution as simple as: `xpath -e '//[not(child::)]' books.xml` [download] If you want to do Perl coding for further processing the solution is still pretty simple `use strict; use warnings; use XML::XPath; use XML::XPath::XMLParser; my $xp = XML::XPath->new(filename => 'books.xml'); my $nodeset = $xp->find('//[not(child::)]'); foreach my $node ($nodeset->get_nodelist) { print "FOUND:", XML::XPath::XMLParser::as_string($node), "\n"; }` [download] Ron	[reply] [d/l] [select]
Re^2: Is it possible to parse an XML file recursively using XML::Twig? by Jenda (Abbot) on Oct 21, 2015 at 16:27 UTC
Please corect me if I'm wrong but ... this will parse the whole (huge) file and produce an even huger maze of objects in memory before you even get a chance to call your find(). A huge waste of memory for such a task. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re^2: Is it possible to parse an XML file recursively using XML::Twig? by Ppeoc (Beadle) on Oct 30, 2015 at 17:57 UTC
Ron, I really like your solution! I was using XML::parse so far. But this one definitely seems to work. Thank you	[reply]
Re: Is it possible to parse an XML file recursively using XML::Twig? by Preceptor (Deacon) on Oct 23, 2015 at 10:25 UTC
"XML::Twig" is implictly recursive. You don't really need to do any more recursion. But I also don't think you're doing anything anywhere near as complicated as that: `#!/usr/bin/env perl use strict; use warnings; use XML::Twig; use Data::Dumper; sub print_book { my ( $twig, $book ) = @_; my %this_book = map { $_ -> tag, $_ -> text } $book -> children; print Dumper \%this_book; $twig -> purge; } my $twig = XML::Twig -> new ( 'twig_handlers' => { 'Book' => \&print_book } ); $twig -> parsefile ( 'your_xml_file' );` [download] This runs through your your XML; and turns each `Book` into a hash. (And then Dumps it, but you can do something more useful). Or have I missed something profound about what you're trying to accomplish? Output from the above looks a bit like: `$VAR1 = { 'Title' => 'The age of rocks', 'Author' => 'Mary', 'Released' => '8/16/1944' };` [download] Alternatively, you can simply use the "_all_" handler, and test each node for having children: `sub handle_node { my ( $twig, $element ) = @_; unless ( $element -> has_children ) { print "(", $element -> parent -> tag, ") ", $element -> tag, ": ", $element -> text,"\n"; } $twig -> purge; } my $twig = XML::Twig -> new ( 'twig_handlers' => { '_all_' => \&handle_node } ); $twig -> parsefile ( 'yourfile');` [download] This will traverse all the nodes, printing any that don't have children, and purging to free up memory. With your sample data, this prints: `(Book) Title: The book of books (Book) Author: Sally (Book) Released: 1/2/2008 (Book) Title: The page of pages (Book) Author: Amanda (Book) Released: 6/3/1998 (Book) Title: The book of pages (Book) Author: John (Book) Released: 6/22/1963 (Book) Title: The rock of ages (Book) Author: Frank (Book) Released: 5/21/2004 (Book) Title: The age of rocks (Book) Author: Mary (Book) Released: 8/16/1944` [download]	[reply] [d/l] [select]
Re^2: Is it possible to parse an XML file recursively using XML::Twig? by mr_ron (Deacon) on Oct 24, 2015 at 16:33 UTC
I had trouble with your second solution that involved " the '_all_' handler, and test each node for having children". When I ran it, as written, I got output like: `(ArrayOfBooks) Book: (ArrayOfBooks) Book: (Book) Released: (ArrayOfBooks) Book: (ArrayOfBooks) Book: (ArrayOfBooks) Book: Can't call method "tag" on an undefined value at monk_twig_xml_leaf2.p +l line 11. at monk_twig_xml_leaf2.pl line 19. at monk_twig_xml_leaf2.pl line 19.` [download] I tried commenting out the "purge" call and got empty output with no errors, seemingly because `$element->has_children` was returning true for "#PCDATA" text nodes. I am new to XML:Twig, but not so new to XML, and am starting to appreciate XML::Twig's potential for optimization. I did come up with some working code as well but would first be interested in what I might be doing wrong that Preceptor's example wouldn't run. Ron	[reply] [d/l] [select]
Re^3: Is it possible to parse an XML file recursively using XML::Twig? by Preceptor (Deacon) on Oct 24, 2015 at 23:23 UTC
Calling tag on undefined value is probably the parent call. Adding a "defined" test there will probably do the trick. But I will suggest that the strength of the module is in using xpath so you rarely need to do a traverse in the first place. ≤/P>	[reply]
Re^4: Is it possible to parse an XML file recursively using XML::Twig? by mr_ron (Deacon) on Oct 26, 2015 at 14:56 UTC
Re^5: Is it possible to parse an XML file recursively using XML::Twig? by Ppeoc (Beadle) on Oct 30, 2015 at 18:40 UTC