tmaly has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to come up with a more efficeint way to parse a large (2 GB) xml readonly file that is gzipped and does not have a root node. Right now I use IO::Pipe like this
$fh = new IO::Pipe; $fh->reader("echo '<root>'; gunzip -c somefile.xml.gz; echo '</root>'; +");
then I pass the $fh to XML::Parser::Expat. What I would like to do is use IO::Zlib and pass that into XML::Parser::Expat. However, I am not sure how I would handle injecting a root element into the IO::Zlib stream. Has anyone ever done something like this before?

Best Regards

Ty

Replies are listed 'Best First'.
Re: XML::Parser::Expat and non conforming XML
by jbert (Priest) on Nov 07, 2006 at 17:28 UTC
    You could create your own IO::Handle, which on the first read returns <root>, subsequently proxies reads to your IO::Zlib and then returns the closing tag when it hits EOF.
Re: XML::Parser::Expat and non conforming XML
by nicholasrperez (Monk) on Nov 09, 2006 at 04:35 UTC
    It would probably make more sense to parse the file as though it were a stream instead of trying to swallow the beast whole. I recommend setting up a SAX handler, feeding X::P::E a root tag, and then start feeding it lines from the gzip'd file.

    When I say "feeding it", that should read "use XML::Parser::ExpatNB and its parse_more() method." Then in the SAX handler, you can build up your own data structure, based on "depth" within the document. Depth meaning when you actually want data (ie. if you want to do processing on particular children nodes inside particular "top level" nodes) instead of filling up your ram with a giant DOM.

    What you are essentially describing is a Jabber IM session (an immense XML document) and this is a solved problem. Not to pimp my own code, but you could take a look at POE::Filter::XML for how this whole feed-the-parser thing can be implemented (with the usual caveats: YMMV, HTH, etc).