PT has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks.

I need to filter a potentially huge xml file. The simplified structure looks like this:

<doc> <text> <p> text1 p1 </p> ... </text> <text> <p> text2 p1 </p> </text> ... </doc>

I need to get the text content of each <text> node, call a binary that processes the text and evaluates whether the current <text> node is to be removed.

My idea is to hold a single <text> node in a memory, evaluate its text content and either print it immediately or forget it and process the next <text> node.

I would like to use LibXML::Reader, but I found no way to incrementally output the XML as it goes through the nodes. I know XML::Twig can flush, but I have encountered SEGFAULT in the past while processing ~GB XML documents through it, which I was not able to debug. So I'd rather stay on the safe side with LibXML. Any idas how to tackle this problem?

Thanks a bunch!

Replies are listed 'Best First'.
Re: Filtering large XML files
by choroba (Cardinal) on Feb 23, 2015 at 10:37 UTC
    To "extract" a node in XML::LibXML::Reader, use copyCurrentNode:
    if ('text' eq $r->name and XML_READER_ELEMENT == $r->nodeType) { print $r->copyCurrentNode(1); }
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thanks, this might work. Do you have any idea how to go about the <doc> node? I cannot simply use copyCurrentNode on that one. When I do this:

      if ('doc' eq $r->name and XML_READER_ELEMENT == $r->nodeType) { print $r->copyCurrentNode(0)->toString; }

      I get:

      <doc attr1=".." attr2=".."/>

      (an empty, self-closed <doc/>). Is there a clean way to correctly output it as just an opening XML_READER_ELEMENT?

Re: Filtering large XML files
by Discipulus (Canon) on Feb 23, 2015 at 11:10 UTC
    You had a segfault with something easy as the folliwing code?
    This just print the text inside text tags. Note that the flush will print his output and clear it while purge only clear it.
    use XML::Twig; my $t= XML::Twig->new(pretty_print => 'indented', twig_handlers => { 'text'=>sub{ print $_[1]->text; $_[0]->purge; # or: # my $useful = $_[1]->text; # my $ret_val = some_sub_that_p +rocess_it_further($useful); # $_[0]->purge; }, } );

    HtH
    L*
    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      I don't remember exactly, but it was nothing much more fancy. I might try Twig again if LibXML does not work to my needs. Thanks!
Re: Filtering large XML files
by Yary (Pilgrim) on Feb 23, 2015 at 13:24 UTC
    Looking over XML::LibXML::Reader, and having worked with XML::LibXML recently, I see a mismatch between what you want and its design. Reader is great for reading XML nodes incrementally but LibXML likes dealing with complete nodes & documents; I can't think of an easy fool-proof way to make it write chunks of a partial document, which is what you'll be doing. (Another monk might yet figure it out!)

    As an alternative, SAX is good for stream-processing XML. It doesn't load an entire document in memory thus is good for incremental work (and not so good for document processing where you want random access), and it's filter-centered design is meant for precisely what you're doing.

    Look for XML::SAX. I only briefly looked at it for a project recently; might be helpful for you. It's a different way of thinking than LibXML for sure!

    EDIT I hadn't seen XML::Twig before, the perl-ish way of handling XML, and in theory "as good as SAX"- good to learn something new! Seems good to use that if possible, and if it still segfaults try contacting the author who is always wanting more tests, see "Test Coverage" on its page.

      XML::LibXML::Reader is kind of a SAX for XML::LibXML. It allows you to process the XML stream, but you can ask it any time to parse the current node and return its corresponding XML::LibXML object.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Thanks Yary for suggesting XML::SAX! I'll look into it. Currently I'm testing XML::Twig. Seems to be working OK so far. Good luck with your projects!