Memory errors while processing 2GB XML file with XML:Twig on Windows 2000

nan has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by dbwiz (Curate) on May 15, 2005 at 19:08 UTC
2GB XML files are beyond the limit of my direct experience with XML, but I can address you to a few places where you may be able to find the right answer: There is an article written by one of the creators of XML, dealing with this kind of problems. Additionally, there was a discussion here on Perlmonks (is XML too hard?), where there are several interesting points Finally, you may try your hand with XML::TokeParser, which is a module that came out of the above mentioned discussion. Good luck.	[reply]
Re^2: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by nan (Novice) on May 16, 2005 at 11:04 UTC
Hi, Thank you so much for the help. I'll try XML:TokeParser later and let you know the result. Thanks again,	[reply]
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by mirod (Canon) on May 15, 2005 at 20:07 UTC
You don't give us much to work with. Some code and perhaps the amount of RAM on your system would help. If you do a straight `XML::Twig->new->parsefile( 'my_big_fat_xml_file.xml');`, then the resulting data structure should need somewhere around 20GB. That's why XML::Twig let's you process a file one chunk at a time, and purge the memory when you're done with it. The README for the module (at least for the latest version) includes links to lots of resources about the module. You could start by looking at xmltwig.com.	[reply] [d/l]
Re^2: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by nan (Novice) on May 16, 2005 at 11:03 UTC
Hi, If you don't mind, please refer to the codes and XML segment in my replies to other people. My RAM is 1GB and my perl version is the latest. Initially, I have a virtual memory error too but it's solved after I changed virtual memory to the maximum (4GB), now I only have experienced a memory writen error. Do you think it could be better if I try linux? Thanks again,	[reply]
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by Zaxo (Archbishop) on May 15, 2005 at 18:52 UTC
Not much info, but the 2GB limit suggests that your perl or OS lacks large file support. With OS support, perl can be recompiled to provide that. Run `perl -V` on the command line to check. After Compline, Zaxo	[reply]
Re^2: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by nan (Novice) on May 16, 2005 at 10:53 UTC
Hi Zaxo, Thank you for the advice. My Perl is v5.8.6 built for MSWin32-x86-multi-thread. My XML sample file is shown below: Basically, the XML file has two key parallelled nodes: <Topic/> and <ExternalPage/>. If there is a <link/> child existing in <Topic/>, <ExternalPage/> node will be existing for showing more detailed information about the content of this <link/> such as <d:Title/> and <d:Description/>. However, not every <Topic/> node has one or more <link/> child, so I need to write a loop to find out if <link/> is a child of <Topic/> nodes. If there are some <link/> nodes existing, I will check each of <ExternalPages> to output more information. Read more... A completed sample (1234 Bytes) my codes are shown below which is quite straightforward: Read more... Full codes (1009 Bytes) Thanks again,	[reply]
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by tlm (Prior) on May 15, 2005 at 18:55 UTC
Without seeing some code it is impossible to give any concrete advice. How have you taken advantage of XML::Twig to process only chunks of the XML tree? Have you considered using a purely event-driven parser like XML::Parser? the lowliest monk	[reply]
Re^2: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by nan (Novice) on May 16, 2005 at 10:56 UTC
Hi, Thank you for the adive. Actually, as the file is huge, some nice people suggested me to try XML:Twig as it is more efficient. My XML case is a little bit funny, if you don't mind, please have a look in more details below: My XML sample file is shown below: Basically, the XML file has two key parallelled nodes: <Topic/> and <ExternalPage/>. If there is a <link/> child existing in <Topic/>, <ExternalPage/> node will be existing for showing more detailed information about the content of this <link/> such as <d:Title/> and <d:Description/>. However, not every <Topic/> node has one or more <link/> child, so I need to write a loop to find out if <link/> is a child of <Topic/> nodes. If there are some <link/> nodes existing, I will check each of <ExternalPages> to output more information. Read more... A completed sample (1234 Bytes) my codes are shown below which is quite straightforward: Read more... Full codes (1009 Bytes) Thanks again,	[reply]
Re^3: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by tlm (Prior) on May 17, 2005 at 04:38 UTC
It looks like you are using XML::Twig in "tree mode" as opposed to "stream mode", which is what I suspected. It means that your code tries to read the entire tree into memory, instead of processing it one chunk at a time, which is what stream mode is good for. As I understand it, there are two basic approaches to parsing a tree like this. You can first build a tree object that then your program can traverse up and down as it pleases, and manipulate like any other data structure. Alternatively, you can define handlers (aka callbacks) that the parser will invoke whenever it encounters a particular condition (e.g. when it finds a particular tag) as it parses the tree. The latter ("event-driven") approach has the advantage that the parser does not need to read the whole tree into memory; the parsing and whatever you want to do with the parsed text go hand-in-hand. The downside is that your program cannot backtrack to examine parts of the tree that have already been parsed. I'm not very familiar with XML::Twig but it appears that it is a bit of a hybrid of these two basic approaches, in that it lets you install handlers that are triggered by parsing events, but it also lets your program access subtrees of the partially parsed tree. These subtrees can be manipulated as a whole, and then purged from memory. This makes it possible to keep only a small part of the tree in memory, just like with any other event-driven parser, such as XML::Parser, but manipulate entire chunks of the tree (possibly the whole tree) like you could with a tree-building parser. Anyway, be that as it may, below is my attempt to re-cast your code in terms of XML::Twig handlers. See the docs for XML::Twig for more details. I could not understand the point of changing directories, so that part of the code may be messed up; I commented it out for the purpose of running the code. The program installs two handlers at the time of invoking the constructor for the parser, one for Topic elements and one for ExternalPage elements. The handlers communicate via a shared variable, the `%links` hash. Let me know how it goes. Read more... (2 kB) the lowliest monk	[reply] [d/l]
Re^4: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by nan (Novice) on May 17, 2005 at 14:21 UTC
Re^5: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by tlm (Prior) on May 18, 2005 at 03:09 UTC
Re: Memory errors while processing 2GB XML file with XML:Twig on Windows 2000 by Molt (Chaplain) on May 16, 2005 at 15:25 UTC
Okay, for a start I'm not mentioning the 2GB file size issue, that's been covered well enough already. I'm just touching XML::Twig itself. Looking at the docs for XML::Twig it looks like it is capable of handling very large XML files by not reading them into memory in one go. Unfortunately I don't think your code does this, you don't set up the handlers and hence it tries to load the entire XML tree into memory. Boom, that'd need 20GB of memory. Reread the docs on XML::Twig, look at the bit on "Processing an XML document chunk by chunk". You need to guarantee you don't have too much in memory at any one time, I hope this is a document built up of lots of small chunks or you're in for an even larger challenge. I'll admit that personally I'd be using a full SAX parser at this point in any case, from what I've seen from my cursory look at XML::Twig does it doesn't look much simpler than trying to do it that way. It's all just handlers and callbacks at the end of the day. As for which SAX parser I'd use I really don't know. I'd normally use >XML::LibXML, but I'm not sure how that'll work on Windows so I can't comment there.	[reply]


Think about Loose Coupling
	PerlMonks