'ello folks,

I'm currently handling some XML documents that are too large to process in memory or to store on disk permanently. However, they fit well enough when gzip'ed (at about 15:1 compression). But when it comes time to process them, I would like to avoid first decompressing them fully. I thought I could solve the problem using IO::Zlib, which provides an interface much like IO::Handle; this would allow me to keep only portions of the decompressed text in memory at a time. And of course, XML::Twig is great for managing big XML documents (thanks mirod!), but it doesn't natively handle gzip'ed XML. But since XML::Parser::Expat and by extension, XML::Twig can take an IO::Handle as a document source, I thought I could string the two together. However, IO::Zlib doesn't actually inherit from IO::Handle, and XML::Parser::Expat demands that UNIVERSAL::isa($arg, 'IO::Handle') be true before it will treat the argument as a handle. I figured a simple workaround like this would work:

package IO::Handle::Zlib; use vars qw/ @ISA /; @ISA = qw/ IO::Zlib IO::Handle /;
which would allow me to replace my IO::Zlib objects with IO::Handle::Zlib's transparently. However, when I try this out, I come across the following error, courtesy of expat:
not well-formed (invalid token) at line 7213, column 3, byte 780490 at + /path/to/perl/lib/5.6.1/IP27-irix/XML/Parser.pm line 185
Now that's odd, since the decompressed file ends at line 7212, and is only 780487 bytes long. One might think the file is being decompressed past the original size of the document, but inserting print DUMP <$gz>; gives a file that is identical to the original (i.e., the angle-bracket read gives a file that is also 7212 lines and 780487 bytes long). So clearly, whatever the XS part of XML::Parser::Expat is doing with the IO::Handle is not what the angle-brackets are doing. And expat itself is working, since replacing
my $reader = new IO::Handle::Zlib; $reader->open( $compressed_filename, "rb" ) or croak "could not open $compressed_filename: $!";
with
my $reader = new IO::File; $reader->open( $uncompressed_filename, "r" );
eliminates the error.

Has anyone used IO::Zlib like this before? Is my IO::Handle::Zlib wrapper bogus? Anybody know how XS modules do IO::Handle reads, and why this doesn't work?

Thanks,

--athomason


In reply to IO::Zlib with XML::Twig/XML::Parser::Expat by athomason

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.