I have parsed huge XML files up to 2GB and I can tell you it takes time, especially if you take an XML-ish approach, i.e. use a XML parser. Of course any DOM-approach is out of the question, loading such a document in memory is asking for trouble.
It depends of course on the complexity of the XML file, the hardware configuration etc. etc. but if you parse huge XML files you basically have to be patient. Don’t expect miracles from other modules, there is no silver bullet when parsing huge XML files.
In my situation parsing the file and updating some nodes took over 1 hour for a 1GB file with XML::Twig. For my requirements this was not sufficient.
Some ideas most of which I explored in the past:
- Rethink the problem!, maybe it’s possible to decrease the size of the XML files. I mean XML files of 3-4GB do sound a bit weird/large.
- Map the XML structure onto a relational database, do a bulk load and let the DB do the work for you.
- Choose a non-XML approach, instead of parsing the file with a parser you might opt for handcrafted Perl solution.
- Take a look at other environments, I got some very good performance out of Xalan using SAX. There is C++ version if you’re allergic to Java;)
- There are native XML databases with good performance but this most likely means spending money!
HTH,
dHarry
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.