Parse XML of large size

kalyanrajsista has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parse XML of large size by marto (Cardinal) on Dec 01, 2009 at 11:57 UTC
You could look at using XML::Twig, which bills itself as being able to process huge XML files. Alternativly you could profile your code with Devel::NYTProf to see if there are any points for improvement, see also Debugging and Optimization from the tutorials section of this site. This of course would require you to change some of your code, however since you don't show us your code we can't point out any issues it may have, or suggest areas for improvement.	[reply]
Re^2: Parse XML of large size by kalyanrajsista (Scribe) on Dec 01, 2009 at 12:07 UTC
Here is the pretty simple code. I've dumped XML file into Perl's data structure using XMLin() which itself is taking so much of time. Rest of the other processing to extract the data is not taking much time for me.. `use strict; use warnings; use XML::Simple; use XML::LibXML; my $XML_FILE = 'sample.xml'; # Dump of the XML file in Perl's Data Structures my $mldata = XMLin($XML_FILE);` [download]	[reply] [d/l]
Re^3: Parse XML of large size by marto (Cardinal) on Dec 01, 2009 at 12:14 UTC
Do you need to read the entire XML file into memory before you begin to process it? See Processing an XML document chunk by chunk from the XML::Twig docs.	[reply]
Re^4: Parse XML of large size by kalyanrajsista (Scribe) on Dec 01, 2009 at 12:44 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.
Re^3: Parse XML of large size by almut (Canon) on Dec 01, 2009 at 12:30 UTC
BTW, `use XML::LibXML` is superfluous here, as you're not actually using it anywhere in the code...	[reply] [d/l]
Re^4: Parse XML of large size by kalyanrajsista (Scribe) on Dec 01, 2009 at 12:33 UTC
Re: Parse XML of large size by almut (Canon) on Dec 01, 2009 at 12:56 UTC
In case you want to stick with XML::Simple (to avoid having to change existing code), you might want to check which parser XML::Simple is using under the hood (in case of doubt, print out which modules have been loaded at the end of your script, using `print join "\n", sort values %INC;`) As you're already using XML::LibXML anyway, you might want to try telling XML::Simple to use it by setting: `$XML::Simple::PREFERRED_PARSER = "XML::LibXML::SAX";` [download] (See XML::Simple Environment for the details.) As XML::LibXML is known to be one of the fastest XML parsers, this might speed up things (even though XML::Simple would of course still be creating its gazillion of hashes and arrays...)	[reply] [d/l] [select]
Re: Parse XML of large size by Jenda (Abbot) on Dec 01, 2009 at 16:39 UTC
XML::Twig may be a good candidate, if you use its simplify() method, then the changes against the XML::Simple based code may be fairly small. You'll just replace the outmost loop by a subroutine definition and instead of parsing the whole file and then looping, you'll tell XML::Twig to call that subroutine for each of those tags. Something like (PSEUDOCODE!): `my $xml = XMLin($file); for my $foo (@{$xml->{foo}) { #and now we process the $foo } => $xmltwig->parse( xml_roots => { 'foo' => \&process_foo }); sub process_foo { my ($xmltwig, $foo_obj) = @_; $foo = $foo_obj->simplify(); #and now we process the $foo }` [download] (It's apparent that I hadn't used XML::Twig for years :-) Another option is to use XML::Rules. It can tweak and simplify the generated structure as the file is parsed and like the XML::Twig, it allows you to execute code once a "twig" (a tag with all subtags and content) is fully parsed. See some of the examples on perlmonks or included with the module. Jenda Enoch was right! Enjoy the last years of Rome.	[reply] [d/l]
Re: Parse XML of large size by Anonymous Monk on Dec 01, 2009 at 11:54 UTC
Is there any way that I can parse huge XML files without disturbing my actual code written to extract the data. No. Depending on what you actually wrote, it shouldn't take too long ( 30min) to convert to XML::Twig	[reply]