How to Parse Huge XML Files ?

Marsel has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have to parse an XML file which is ... 1,8 Go large ! My first try was with a simple XML::Parser::Wrapper and of course, it worked perfectly with an example, but ended with 'out of memory' with the real one !!

I then found on the net an article from Kip Hampton from 2001 (on XML.com) which starts with : "The Problem : The XML documents you have to parse are getting too large to load the entire document tree into memory; performance is suffering. The solution : use SAX".
So i drop into XML::SAX, and after a while, understood (i hope) how to code my own handler package with common 'start_element', 'end_element', 'characters', ... methods to do the job.

Again everything is going well with the test file, but when i launch the script with the big one, memory usage starts to reach 100%, then it's the swap usage, and then ... i killed it.

Do you think it could comme from a bad way of coding, or is the module inapropriated, for such a big file ? And if so, would you know a module i could use to read the file inline, without loading all into memory ?
Thanks

julien

Comment on How to Parse Huge XML Files ?

Replies are listed 'Best First'.
Re: How to Parse Huge XML Files ? by marto (Cardinal) on May 31, 2006 at 17:17 UTC
hi Marsel, Take a look at XML::Twig "A perl module for processing huge XML documents in tree mode." Update:You may also waht to take a look at xmltwig.com, it has some example programs (as well as stacks of other information) which may be of interest to you. Hope this helps. Martin	[reply]
Re: How to Parse Huge XML Files ? by samtregar (Abbot) on May 31, 2006 at 17:54 UTC
You must be doing something wrong - XML::SAX is designed to process huge XML files. Show us your code and we'll show you the problem. Most likely you're building up a huge data-structure in your handler methods, but that's just one possibility. Also, what XML::SAX parser are you using? The default (XML::SAX::PurePerl) has terrible performance and I wouldn't be at all surprised to find that it leaks memory. I suggest you try XML::SAX::ExpatXS, which is fast and has the best support for the complete SAX interface. -sam	[reply]
Re: How to Parse Huge XML Files ? by davido (Cardinal) on May 31, 2006 at 17:55 UTC
XML::Twig is an excellent alternative. I would go so far as to recommend it even for XML files that are known to be smaller in size too. It has a "simple" mode which looks and feels similar to XML::Simple, yet the XML::Twig module is more robust, handles a wider range of XML, and can accommodate enormous XML files. Dave	[reply]
Re: How to Parse Huge XML Files ? by jsegal (Friar) on May 31, 2006 at 17:46 UTC
Without seeing your code, it is impossible to know precisely what is going on, but don't forget that it also depends on what else you are doing/how you are processing the file. For example, if you are builing an in-memory data structure based on the file contents, you could cause yourself to run out of memory even when processing the file as SAX events! Are you processing the file/events sequentially, or building up some other structure in memory? Obviously, with a large file, you are better off if you keep only a small amount of "processing data" in memory, too. All the best, --JAS	[reply]
Re^2: How to Parse Huge XML Files ? by Marsel (Sexton) on Jun 01, 2006 at 04:21 UTC
Thanks for all these answers. here is my code Read more... (5 kB) And you were right ! I forgot to undef the hash structure that holds data !! But it still doesn't work When i launch it it fulls my mem & swap (the sum is 6Go). For example, it even doesn't print the first method message "Here we go ..............", which is printed in response to start_document event. The main code is here : Read more... (3 kB) Thanks for advices, i'll have a look at XML::Twig also. your sincerily Julien Edited by planetscape - added readmore tags Read more... view votes (25 Bytes)	[reply] [d/l] [select]
Re^3: How to Parse Huge XML Files ? by jsegal (Friar) on Jun 01, 2006 at 16:05 UTC
Hmm. If your initial status message isn't getting printed out, I'd double check that you are running what you think you are running. (I find the debugger invaluable in cases like this -- I happen to like running it from within (x)emacs). Sometimes a module doesn't do what you think it is going to do, and sometime you aren't even running the code you think you are running! I know I've been burned by editing a file in one directory, but actually running a version in another directory -- when putting in debugging print statements, I've learned to vary what I output, so I instantly have a positive control that I am running the version of the file I should be -- if the output is "foo" but I just added "baz", I instantly know something is amiss, and don't try to debug the wrong thing... All that being said, this may not be your problem, but it might give you some clues as to what is going on.... Good luck, --JAS	[reply]
Re: How to Parse Huge XML Files ? by ambrus (Abbot) on Jun 01, 2006 at 13:28 UTC
XML::Twig is a great module for stream-parsing an XML. Here's a link to when I've used it: Do not reinvent the wheel: real-world example using XML::Twig.	[reply]
Re: How to Parse Huge XML Files ? by toma (Vicar) on Jun 02, 2006 at 23:58 UTC
You might want to take a look at my article that compares SAX and Twig. In my large-file test, SAX and Twig were close in each other in performance, at least in terms of speed. The testing included improvements to the code suggested by matts, mirod, and barries. My code ended up being a lot faster when I was done than it was when I started! It should work perfectly the first time! - toma	[reply]