record-by-record XML parsing

tbusch has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: record-by-record XML parsing by eff_i_g (Curate) on Sep 09, 2008 at 16:44 UTC
How big will the document be? Do you need XPath capabilities? I like XML::Twig.	[reply]
Re^2: record-by-record XML parsing by tbusch (Sexton) on Sep 09, 2008 at 16:59 UTC
Hi, the document could be as big as 1 GB. The fastest I know is XML::LibXML but even that module doesn't seem quick enough. Thomas.	[reply]
Re^3: record-by-record XML parsing by eff_i_g (Curate) on Sep 09, 2008 at 18:23 UTC
Honestly, I've only worked with XML::Twig so I cannot tell you how it compares to other modules. However, I recommend looking at this FAQ entry: That said, ifyou need to process a document that is too big to fit memory and XML::Twig is too slow for you, my reluctant advice would be to use "bare" XML::Parser. It won't be as easy to use as XML::Twig: basically with XML::Twig you trade some speed (depending on what you do from a factor 3 to... none) for ease-of-use, but it will be easier IMHO than using SAX (albeit not standard), and at this point a LOT faster (see the last test in simple benchmark).	[reply]
Re^3: record-by-record XML parsing by mirod (Canon) on Sep 09, 2008 at 18:36 UTC
1 GB is going to be slow no matter what. If you have enough memory to load the document entirely in memory, then XML::LibXML will be the fastest option. If you can't load it and you need to use SAX, then performances drop dramatically (as mentioned previously, look at this benchmark). I am not sure the pull interface is available from the Perl module, but that might give you XML::LibXML speed without having to load the whole document (I have never used it myself, so I am not sure if that's even possible). XML::Twig would be a better option than using SAX. And indeed, XML::Parser might be your best bet, even if it's not the easiest module to use. Beyond that, you get into dangerous territory: format the XML properly, get rid of potential comments, CDATA sections, PIs and entities, and use regexps. I would try the other options before though.	[reply]
Re: record-by-record XML parsing by dHarry (Abbot) on Sep 10, 2008 at 11:45 UTC
Your question could be more specific to be able to give a good answer. What does the structure of your XML document looks like and how do you define a "record" in XML context? The complexity of the document plays a role. Furthermore can you quantify "high-performance"? Is it 100MB per minute?! I have parsed 1GB XML documents and in my experience it takes time to do so. In a Perl context I have mainly used XML::Twig. I have parsed large simple XML documents (see Putting XML::Twig to the test for an example). Also you can save approximately 30% by optimizing XML::Twig. I am still working on improving my solution, i.e. doing a proof of concept with it. I am not a fan of reading large documents into memory. In my experience it doesn’t speed up the parsing at all. Parsing a document typically generates a lot of method/function calls (the bottleneck) whether it resides in memory or not. Let me know what solution you end up with. I have a special interest in parsing large XML documents myself. Cheers dHarry	[reply]
Re: record-by-record XML parsing by pajout (Curate) on Sep 09, 2008 at 22:27 UTC
It very depends if you need to have access to all records (random access) after parsing or if you need just to iterate from first record to last record, like stream. In the second case, every parser producing events should satisfy you - I would use XML::Parser::Expat, because I am familiar with it.	[reply]