wardy3 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks. Hope someone can point me in the right direction, please

I am trying to write an XML parser for a large-ish XML file. It's about 800,000 lines long but can get bigger (or smaller). The format of the file looks like:

<?xml version="1.0" standalone="yes" ?> <abc> <def> <ghi> <jkl>information</jkl> </ghi> <important> <data> <info1>info 1</info1> <info2>info 2</info2> <info4> <data>data 4</data> </info4> </data> <data> <info1>info</info1> <info2>info</info2> </data> </important> </def> </abc>

Note the duplicate tag "data" that has a different path

I have used XML::Parser before for parsing a similar file but thought I'd explore the world of the latest parsers. So far, I have tried XML::SAX (using XML::SAX::ExpatXS), XML::Twig and - just now - XML::Rules. I believe I need a stream parser because of the size, even though I need the path sometimes.

All these parsers seem significantly slower than the old-fashioned XML::Parser. I was a bit surprised at this given the age of it. My sample scripts are below, with timings. I'm running on cygwin using a Dell Latitude D610.

Should I move with the times, do you think?

Thanks, Michael

Here's some of the scripts I've been using to benchmark the parsers. Please note that I haven't put all the complications in here, just trying to get a base idea of how fast they can get through the file.

If I've missed something or done something a weird way, I'd appreciate the feedback. I'm tracking the element path as I need to use it in the real version.

SAX

Takes about 1 minute

use strict; use XML::SAX; my $parser = XML::SAX::ParserFactory->parser( Handler => symdevHandler->new ); open my $symdev_list, '935.xml'; $parser->parse_file($symdev_list); close $symdev_list; package symdevHandler; use base qw(XML::SAX::Base); my ( $dev ); my @element_stack; my %conf_of; sub start_element { my ($self, $el) = @_; push @element_stack, $el->{Name}; } sub end_element { my ($self, $el) = @_; pop @element_stack; } sub characters { my ($self, $el) = @_; my $text = $el->{Data}; my $in_element = $element_stack[-1]; if ( $in_element eq 'dev_name' ) { $dev = $text; } elsif ( $in_element eq 'configuration' ) { $conf_of{$dev} = $text; } } sub end_document { my ($self, $el) = @_; for my $key ( keys %conf_of ) { print "$key\t$conf_of{$key}\n"; } } 1;
XML::Parser

Takes about 40 seconds

use strict; use XML::Parser; # Set up the XML parser to point to standard symdev processing subrout +ines my $parser = XML::Parser->new( Handlers => { Start => \&symdev_start, End => \&symdev_end, Final => \&symdev_final, Char => \&symdev_char, } ); open my $symdev_list, '935.xml'; $parser->parse($symdev_list); close $symdev_list; { my ( $dev, $text ); my @element_stack; my %conf_of; sub symdev_start { my ( $expat, $name, %atts ) = @_; push @element_stack, $name; $text = ''; } sub symdev_end { my ( $expat, $name, %atts ) = @_; pop @element_stack; if ( $name eq 'dev_name' ) { $dev = $text; } elsif ( $name eq 'configuration' ) { $conf_of{$dev} = $text; } } sub symdev_char { my ( $expat, $string ) = @_; $text .= $string; } sub symdev_final { my ( $expat, $name, %atts ) = @_; for my $key ( keys %conf_of ) { print "$key\t$conf_of{$key}\n"; } } }
XML::Rules

Takes about 90 seconds but isn't doing much work

use strict; use warnings; use XML::Rules; open my $symdev_list, '935.xml'; my @rules = ( dev_name => sub{ my $dev = $_[1]->{_content} }, ); my $parser = XML::Rules->new(rules => \@rules); $parser->parse( $symdev_list); close $symdev_list;

Replies are listed 'Best First'.
Re: Which XML parser would be the wisest to use
by runrig (Abbot) on Feb 21, 2008 at 01:12 UTC
    I'm not sure what XML::Rules is using under the hood, but how do you know XML::SAX is using XML::SAX::ExpatXS? It might be using the PurePerl Parser, and so would be extremely slow. The alternative to XML::Parser (which uses expat) is XML::LibXML (which uses libxml) at the low level. Everything else is just wrappers around those, and bound to be 'slower', but possibly easier to use for your specific problem.

    Update: XML::Rules uses XML::Parser::Expat...but since it's a wrapper around Expat, it would be slower than the XML::Parser::Expat module alone...but faster/slower would not be the point of using XML::Rules over XML::Parser::Expat in the first place.

    When I ran your XML::SAX code, it used XML::LibXML (via XML::LibXML::SAX) under the hood, but then, I have XML::LibXML installed, and ParserDetails.ini is set to use it.

      Thanks, runrig

      I actually set

      $XML::SAX::ParserPackage
      in my SAX test and gave each of the modules a go. ExpatXS was the fastest, so I just used it.

      I thought XML::LibXML was a tree parser and I have had little luck with them as I run out of memory and my windoze session grinds to a halt :-(

      I quite like the feel of SAX and might just put up with the penalty but I wanted to get some opinions before just ploughing ahead and coding all the parsers I need.

      BTW, PurePerl took a lot longer. Here's the original run where I timed a few of the SAX modules :)

      XML::LibXML::SAX real 1m9.658s user 1m3.186s sys 0m0.312s XML::SAX::Expat real 2m29.873s user 2m2.421s sys 0m0.389s XML::SAX::ExpatXS real 0m55.370s user 0m49.014s sys 0m0.311s XML::LibXML::SAX::Parser real 2m31.700s user 2m14.342s sys 0m0.483s XML::SAX::PurePerl real 5m2.766s user 4m23.733s sys 0m0.515s
        XML::LibXML can be a purely SAX parser (XML::LibXML::SAX) if no DOM functions are used. XML::LibXML::SAX::Parser on the other hand says that it builds the DOM and then generates SAX events.
Re: Which XML parser would be the wisest to use
by mirod (Canon) on Feb 21, 2008 at 05:45 UTC

    Can your data fit in an XML::LibXML DOM (ie in XML::LibXML is tree mode)? If it does then go for it. That will be the fastest you can get in Perl.

    If not XML::Parser is probably the fastest you can get: XML::Twig and XML::Rules are based on it, and are usually slower. Surprisingly, I found that SAX, whether based on XML::Parser or on XML::LibXML, is very slow (see at the end of Simple Perl XML Benchmark).

      Thanks, Mirod!

      That's a great table. I noticed XSLT did will in the extract text from elements.

      I had thought about XSLT but assumed it'd be too slow. I tried to learn it a few years ago and maybe it's time to re-visit.

        The thing is, XML::LibXSLT will load the entire document in memory. And as it is based on libxml2, just like XML::LibXML, it probably needs about the same amount of space as XML::LibXML.

        An alternate solution that I forgot to mention, mostly because I have never tried it and I don't know even if XML::LibXML supports it: libxml2 has a pull mode, that you might be able to use to lower the memory requirements of your code (by deleting things you don't use any more in your DOM). If you go that route, it'd be interesting if you could describe how it works, because that could be a good alternative to SAX when processing huge documents.

Re: Which XML parser would be the wisest to use
by Skeeve (Parson) on Feb 21, 2008 at 10:35 UTC
    The wisest parser to use is the one that fits your needs best. If all fit, use the one you understand the best.

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Thanks, Skeeve. Very Zen.

      After my long journey, I agree. I really wanted speed and that's what I'm going for :-)