Ppeoc has asked for the wisdom of the Perl Monks concerning the following question:

How can I parse a huge highly nested XML document to display each leaf element. I want to do this recursively where the child is thrown into the recursive function the leaf is found. Is this possible using XML:TWIG?
Here is a sample code I want to parse
<?xml version="1.0" encoding="UTF-8"?> -<ArrayOfBooks> -<Book> <Title>The book of books</Title> <Author>Sally</Author> <Released>1/2/2008</Released> </Book> -<Book> <Title>The page of pages</Title> <Author>Amanda</Author> <Released>6/3/1998</Released> </Book> <Book> <Title>The book of pages</Title> <Author>John</Author> <Released>6/22/1963</Released> </Book> <Book> <Title>The rock of ages</Title> <Author>Frank</Author> <Released>5/21/2004</Released> </Book> <Book> <Title>The age of rocks</Title> <Author>Mary</Author> <Released>8/16/1944</Released> </Book> </ArrayOfBooks>
Say I want to parse all the children in the snippet code below. I want to recursively parse each node till the last child is reached. I have no information about the nodes as they are dynamic. My actual file is highly nested. How can I do this using Twig?
my $twig = XML::Twig->new( twig_handlers => { '/ArrayOfBooks/Book' => +\&sect} ); my $input ='Books.XML'; $twig->parsefile($input); sub sect { my ($twig, $ele) = @_; depth++ #if node does not contain children { end_element($ele); depth--; return; } #node contains children sec($ele); } sub end_element { # I need both key and value. eg Title: The page of pages my ($leaf) = @_; print $key; }

Replies are listed 'Best First'.
Re: Parsing a huge XML document recursively
by choroba (Cardinal) on Oct 19, 2015 at 17:03 UTC
    I tried with XML::LibXML::Reader instead. It's just an example (it doesn't handle processing instructions, comments, entities etc.), but it could get you started.
    #!/usr/bin/perl use warnings; use strict; use XML::LibXML::Reader; my $reader = 'XML::LibXML::Reader'->new(location => shift) or die; my $node; while ($reader->read) { if ($reader->nodeType == XML_READER_TYPE_ELEMENT) { $node = $reader->copyCurrentNode(0); } elsif ($reader->nodeType == XML_READER_TYPE_TEXT) { $node->appendText($reader->copyCurrentNode(0)); } elsif ($reader->nodeType == XML_READER_TYPE_END_ELEMENT) { print $node, "\n"; } }

    copyCurrentNode(0) creates a "shallow" copy, i.e. it doesn't go into the subtree. It keeps the attributes, though.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Parsing a huge XML document recursively
by Jenda (Abbot) on Oct 19, 2015 at 20:08 UTC

    Yes, though you'll most probably want to process a whole twig, not just the individual leafs. Without more details about the XML it's impossible to be more specific.

    Another contender would be XML::Rules. With the right handlers it can process huge files easily.

    Update: I do not use XML::Twig myself. This is an example of what you could do with XML::Rules:

    use strict; use XML::Rules; my $parser = XML::Rules->new( stripspaces => 7, rules => { _default => 'content', Book => sub { my ($tag,$attr) = @_; if (!$attr->{Title}) { return; }# no title, OK. ignore print "$attr->{Title} by $attr->{Author} was released $att +r->{Released}\n"; return; }, }, ); $parser->parse(\*DATA); # or $parser->parsefile('path/to/the/file');

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      I have edited the question adding more details. Could you please help me?
Re: Is it possible to parse an XML file recursively using XML::Twig?
by mr_ron (Deacon) on Oct 20, 2015 at 15:37 UTC

    I like XPath and found some XPath solutions to the problem on stackoverflow here including this one: //*[not(child::*)]. XML::Twig didn't seem to handle the XPath expressions on stackoverflow but XML::XPath did. XML::XPath also comes with an 'xpath' grep like utility that might give you a solution as simple as:

    xpath -e '//*[not(child::*)]' books.xml

    If you want to do Perl coding for further processing the solution is still pretty simple

    use strict; use warnings; use XML::XPath; use XML::XPath::XMLParser; my $xp = XML::XPath->new(filename => 'books.xml'); my $nodeset = $xp->find('//*[not(child::*)]'); foreach my $node ($nodeset->get_nodelist) { print "FOUND:", XML::XPath::XMLParser::as_string($node), "\n"; }
    Ron

      Please corect me if I'm wrong but ... this will parse the whole (huge) file and produce an even huger maze of objects in memory before you even get a chance to call your find(). A huge waste of memory for such a task.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

      Ron, I really like your solution! I was using XML::parse so far. But this one definitely seems to work. Thank you
Re: Is it possible to parse an XML file recursively using XML::Twig?
by Preceptor (Deacon) on Oct 23, 2015 at 10:25 UTC

    "XML::Twig" is implictly recursive. You don't really need to do any more recursion. But I also don't think you're doing anything anywhere near as complicated as that:

    #!/usr/bin/env perl use strict; use warnings; use XML::Twig; use Data::Dumper; sub print_book { my ( $twig, $book ) = @_; my %this_book = map { $_ -> tag, $_ -> text } $book -> children; print Dumper \%this_book; $twig -> purge; } my $twig = XML::Twig -> new ( 'twig_handlers' => { 'Book' => \&print_book } ); $twig -> parsefile ( 'your_xml_file' );

    This runs through your your XML; and turns each `Book` into a hash. (And then Dumps it, but you can do something more useful). Or have I missed something profound about what you're trying to accomplish? Output from the above looks a bit like:

    $VAR1 = { 'Title' => 'The age of rocks', 'Author' => 'Mary', 'Released' => '8/16/1944' };

    Alternatively, you can simply use the "_all_" handler, and test each node for having children:

    sub handle_node { my ( $twig, $element ) = @_; unless ( $element -> has_children ) { print "(", $element -> parent -> tag, ") ", $element -> tag, ": ", $element -> text,"\n"; } $twig -> purge; } my $twig = XML::Twig -> new ( 'twig_handlers' => { '_all_' => \&handle_node } ); $twig -> parsefile ( 'yourfile');

    This will traverse all the nodes, printing any that don't have children, and purging to free up memory. With your sample data, this prints:

    (Book) Title: The book of books (Book) Author: Sally (Book) Released: 1/2/2008 (Book) Title: The page of pages (Book) Author: Amanda (Book) Released: 6/3/1998 (Book) Title: The book of pages (Book) Author: John (Book) Released: 6/22/1963 (Book) Title: The rock of ages (Book) Author: Frank (Book) Released: 5/21/2004 (Book) Title: The age of rocks (Book) Author: Mary (Book) Released: 8/16/1944

      I had trouble with your second solution that involved " the '_all_' handler, and test each node for having children". When I ran it, as written, I got output like:

      (ArrayOfBooks) Book: (ArrayOfBooks) Book: (Book) Released: (ArrayOfBooks) Book: (ArrayOfBooks) Book: (ArrayOfBooks) Book: Can't call method "tag" on an undefined value at monk_twig_xml_leaf2.p +l line 11. at monk_twig_xml_leaf2.pl line 19. at monk_twig_xml_leaf2.pl line 19.

      I tried commenting out the "purge" call and got empty output with no errors, seemingly because $element->has_children was returning true for "#PCDATA" text nodes. I am new to XML:Twig, but not so new to XML, and am starting to appreciate XML::Twig's potential for optimization. I did come up with some working code as well but would first be interested in what I might be doing wrong that Preceptor's example wouldn't run.

      Ron

        Calling tag on undefined value is probably the parent call. Adding a "defined" test there will probably do the trick. But I will suggest that the strength of the module is in using xpath so you rarely need to do a traverse in the first place. ≤/P>