kalyanrajsista has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm trying to parse a huge xml file using XML::Simple and XML::LibXML. It works good for small size files, but taking out so much time for files more than 50MB.

Is there any way that I can parse huge XML files without disturbing my actual code written to extract the data.

Replies are listed 'Best First'.
Re: Parse XML of large size
by marto (Cardinal) on Dec 01, 2009 at 11:57 UTC

    You could look at using XML::Twig, which bills itself as being able to process huge XML files. Alternativly you could profile your code with Devel::NYTProf to see if there are any points for improvement, see also Debugging and Optimization from the tutorials section of this site. This of course would require you to change some of your code, however since you don't show us your code we can't point out any issues it may have, or suggest areas for improvement.

      Here is the pretty simple code.

      I've dumped XML file into Perl's data structure using XMLin() which itself is taking so much of time.

      Rest of the other processing to extract the data is not taking much time for me..

      use strict; use warnings; use XML::Simple; use XML::LibXML; my $XML_FILE = 'sample.xml'; # Dump of the XML file in Perl's Data Structures my $mldata = XMLin($XML_FILE);
        A reply falls below the community's threshold of quality. You may see it by logging in.

        BTW, use XML::LibXML is superfluous here, as you're not actually using it anywhere in the code...

Re: Parse XML of large size
by almut (Canon) on Dec 01, 2009 at 12:56 UTC

    In case you want to stick with XML::Simple (to avoid having to change existing code), you might want to check which parser XML::Simple is using under the hood  (in case of doubt, print out which modules have been loaded at the end of your script, using print join "\n", sort values %INC;)

    As you're already using XML::LibXML anyway, you might want to try telling XML::Simple to use it by setting:

    $XML::Simple::PREFERRED_PARSER = "XML::LibXML::SAX";

    (See XML::Simple Environment for the details.)

    As XML::LibXML is known to be one of the fastest XML parsers, this might speed up things (even though XML::Simple would of course still be creating its gazillion of hashes and arrays...)

Re: Parse XML of large size
by Jenda (Abbot) on Dec 01, 2009 at 16:39 UTC

    XML::Twig may be a good candidate, if you use its simplify() method, then the changes against the XML::Simple based code may be fairly small. You'll just replace the outmost loop by a subroutine definition and instead of parsing the whole file and then looping, you'll tell XML::Twig to call that subroutine for each of those tags. Something like (PSEUDOCODE!):

    my $xml = XMLin($file); for my $foo (@{$xml->{foo}) { #and now we process the $foo } => $xmltwig->parse( xml_roots => { 'foo' => \&process_foo }); sub process_foo { my ($xmltwig, $foo_obj) = @_; $foo = $foo_obj->simplify(); #and now we process the $foo }
    (It's apparent that I hadn't used XML::Twig for years :-)

    Another option is to use XML::Rules. It can tweak and simplify the generated structure as the file is parsed and like the XML::Twig, it allows you to execute code once a "twig" (a tag with all subtags and content) is fully parsed. See some of the examples on perlmonks or included with the module.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: Parse XML of large size
by Anonymous Monk on Dec 01, 2009 at 11:54 UTC
    Is there any way that I can parse huge XML files without disturbing my actual code written to extract the data.

    No. Depending on what you actually wrote, it shouldn't take too long ( 30min) to convert to XML::Twig