Okay, for a start I'm not mentioning the 2GB file size issue, that's been covered well enough already. I'm just touching XML::Twig itself.

Looking at the docs for XML::Twig it looks like it is capable of handling very large XML files by not reading them into memory in one go. Unfortunately I don't think your code does this, you don't set up the handlers and hence it tries to load the entire XML tree into memory. Boom, that'd need 20GB of memory.

Reread the docs on XML::Twig, look at the bit on "Processing an XML document chunk by chunk". You need to guarantee you don't have too much in memory at any one time, I hope this is a document built up of lots of small chunks or you're in for an even larger challenge.

I'll admit that personally I'd be using a full SAX parser at this point in any case, from what I've seen from my cursory look at XML::Twig does it doesn't look much simpler than trying to do it that way. It's all just handlers and callbacks at the end of the day.

As for which SAX parser I'd use I really don't know. I'd normally use >XML::LibXML, but I'm not sure how that'll work on Windows so I can't comment there.


In reply to Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by Molt
in thread Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by nan

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.