PT has asked for the wisdom of the Perl Monks concerning the following question:
Hello Perl Monks.
I need to filter a potentially huge xml file. The simplified structure looks like this:
<doc> <text> <p> text1 p1 </p> ... </text> <text> <p> text2 p1 </p> </text> ... </doc>
I need to get the text content of each <text> node, call a binary that processes the text and evaluates whether the current <text> node is to be removed.
My idea is to hold a single <text> node in a memory, evaluate its text content and either print it immediately or forget it and process the next <text> node.
I would like to use LibXML::Reader, but I found no way to incrementally output the XML as it goes through the nodes. I know XML::Twig can flush, but I have encountered SEGFAULT in the past while processing ~GB XML documents through it, which I was not able to debug. So I'd rather stay on the safe side with LibXML. Any idas how to tackle this problem?
Thanks a bunch!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Filtering large XML files
by choroba (Cardinal) on Feb 23, 2015 at 10:37 UTC | |
by PT (Novice) on Feb 23, 2015 at 11:21 UTC | |
|
Re: Filtering large XML files
by Discipulus (Canon) on Feb 23, 2015 at 11:10 UTC | |
by PT (Novice) on Feb 23, 2015 at 11:24 UTC | |
|
Re: Filtering large XML files
by Yary (Pilgrim) on Feb 23, 2015 at 13:24 UTC | |
by choroba (Cardinal) on Feb 24, 2015 at 13:40 UTC | |
by PT (Novice) on Feb 24, 2015 at 09:26 UTC |