Filtering large XML files

PT has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks.

I need to filter a potentially huge xml file. The simplified structure looks like this:

<doc>
<text>
<p>
text1 p1
</p>
...
</text>
<text>
<p>
text2 p1
</p>
</text>
...
</doc>
[download]

I need to get the text content of each <text> node, call a binary that processes the text and evaluates whether the current <text> node is to be removed.

My idea is to hold a single <text> node in a memory, evaluate its text content and either print it immediately or forget it and process the next <text> node.

I would like to use LibXML::Reader, but I found no way to incrementally output the XML as it goes through the nodes. I know XML::Twig can flush, but I have encountered SEGFAULT in the past while processing ~GB XML documents through it, which I was not able to debug. So I'd rather stay on the safe side with LibXML. Any idas how to tackle this problem?

Thanks a bunch!

Comment on Filtering large XML files Select or Download Code

Replies are listed 'Best First'.
Re: Filtering large XML files by choroba (Cardinal) on Feb 23, 2015 at 10:37 UTC
To "extract" a node in XML::LibXML::Reader, use `copyCurrentNode`: `if ('text' eq $r->name and XML_READER_ELEMENT == $r->nodeType) { print $r->copyCurrentNode(1); }` [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re^2: Filtering large XML files by PT (Novice) on Feb 23, 2015 at 11:21 UTC
Thanks, this might work. Do you have any idea how to go about the `<doc>` node? I cannot simply use `copyCurrentNode` on that one. When I do this: `if ('doc' eq $r->name and XML_READER_ELEMENT == $r->nodeType) { print $r->copyCurrentNode(0)->toString; }` [download] I get: `<doc attr1=".." attr2=".."/>` (an empty, self-closed `<doc/>`). Is there a clean way to correctly output it as just an opening `XML_READER_ELEMENT`?	[reply] [d/l] [select]
Re: Filtering large XML files by Discipulus (Canon) on Feb 23, 2015 at 11:10 UTC
You had a segfault with something easy as the folliwing code? This just print the text inside text tags. Note that the flush will print his output and clear it while purge only clear it. `use XML::Twig; my $t= XML::Twig->new(pretty_print => 'indented', twig_handlers => { 'text'=>sub{ print $_[1]->text; $_[0]->purge; # or: # my $useful = $_[1]->text; # my $ret_val = some_sub_that_p +rocess_it_further($useful); # $_[0]->purge; }, } );` [download] HtH L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^2: Filtering large XML files by PT (Novice) on Feb 23, 2015 at 11:24 UTC
I don't remember exactly, but it was nothing much more fancy. I might try Twig again if LibXML does not work to my needs. Thanks!	[reply]
Re: Filtering large XML files by Yary (Pilgrim) on Feb 23, 2015 at 13:24 UTC
Looking over XML::LibXML::Reader, and having worked with XML::LibXML recently, I see a mismatch between what you want and its design. `Reader` is great for reading XML nodes incrementally but `LibXML` likes dealing with complete nodes & documents; I can't think of an easy fool-proof way to make it write chunks of a partial document, which is what you'll be doing. (Another monk might yet figure it out!) As an alternative, `SAX` is good for stream-processing XML. It doesn't load an entire document in memory thus is good for incremental work (and not so good for document processing where you want random access), and it's filter-centered design is meant for precisely what you're doing. Look for XML::SAX. I only briefly looked at it for a project recently; might be helpful for you. It's a different way of thinking than LibXML for sure! EDIT I hadn't seen XML::Twig before, the perl-ish way of handling XML, and in theory "as good as SAX"- good to learn something new! Seems good to use that if possible, and if it still segfaults try contacting the author who is always wanting more tests, see "Test Coverage" on its page.	[reply]
Re^2: Filtering large XML files by choroba (Cardinal) on Feb 24, 2015 at 13:40 UTC
XML::LibXML::Reader is kind of a SAX for XML::LibXML. It allows you to process the XML stream, but you can ask it any time to parse the current node and return its corresponding XML::LibXML object. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re^2: Filtering large XML files by PT (Novice) on Feb 24, 2015 at 09:26 UTC
Thanks Yary for suggesting `XML::SAX`! I'll look into it. Currently I'm testing `XML::Twig`. Seems to be working OK so far. Good luck with your projects!	[reply] [d/l] [select]