Which XML parser would be the wisest to use

wardy3 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks. Hope someone can point me in the right direction, please

I am trying to write an XML parser for a large-ish XML file. It's about 800,000 lines long but can get bigger (or smaller). The format of the file looks like:

<?xml version="1.0" standalone="yes" ?>

<abc>
  <def>
    <ghi>
      <jkl>information</jkl>
    </ghi>
    <important>
      <data>
        <info1>info 1</info1>
        <info2>info 2</info2>
        <info4>
            <data>data 4</data>
        </info4>
      </data>
      <data>
        <info1>info</info1>
        <info2>info</info2>
      </data>
    </important>
   </def>
</abc>
[download]

Note the duplicate tag "data" that has a different path

I have used XML::Parser before for parsing a similar file but thought I'd explore the world of the latest parsers. So far, I have tried XML::SAX (using XML::SAX::ExpatXS), XML::Twig and - just now - XML::Rules. I believe I need a stream parser because of the size, even though I need the path sometimes.

All these parsers seem significantly slower than the old-fashioned XML::Parser. I was a bit surprised at this given the age of it. My sample scripts are below, with timings. I'm running on cygwin using a Dell Latitude D610.

Should I move with the times, do you think?

Thanks, Michael

Here's some of the scripts I've been using to benchmark the parsers. Please note that I haven't put all the complications in here, just trying to get a base idea of how fast they can get through the file.

If I've missed something or done something a weird way, I'd appreciate the feedback. I'm tracking the element path as I need to use it in the real version.

SAX

Takes about 1 minute

use strict;
use XML::SAX;

my $parser = XML::SAX::ParserFactory->parser(
    Handler => symdevHandler->new
);

open my $symdev_list, '935.xml';
$parser->parse_file($symdev_list);
close $symdev_list;

package symdevHandler;
use base qw(XML::SAX::Base);

my ( $dev );
my @element_stack;
my %conf_of;

sub start_element {
    my ($self, $el) = @_;

    push @element_stack, $el->{Name};
}

sub end_element {
    my ($self, $el) = @_;

    pop @element_stack;
}

sub characters {
    my ($self, $el) = @_;

    my $text = $el->{Data};

    my $in_element = $element_stack[-1];

    if ( $in_element eq 'dev_name' ) {
        $dev = $text;
    }
    elsif ( $in_element eq 'configuration' ) {
        $conf_of{$dev} = $text;
    }

}

sub end_document {
    my ($self, $el) = @_;

    for my $key ( keys %conf_of ) {
        print "$key\t$conf_of{$key}\n";
    }
}

1;
[download]

XML::Parser

Takes about 40 seconds

use strict;
use XML::Parser;
# Set up the XML parser to point to standard symdev processing subrout
+ines
my $parser = XML::Parser->new(
    Handlers => {
        Start => \&symdev_start,
        End   => \&symdev_end,
        Final => \&symdev_final,
        Char  => \&symdev_char,
    }
);

open my $symdev_list, '935.xml';
$parser->parse($symdev_list);
close $symdev_list;

{
    my ( $dev, $text );
    my @element_stack;
    my %conf_of;

    sub symdev_start {
        my ( $expat, $name, %atts ) = @_;

        push @element_stack, $name;

        $text = '';
    }
    sub symdev_end {
        my ( $expat, $name, %atts ) = @_;

        pop @element_stack;

        if ( $name eq 'dev_name' ) {
            $dev = $text;
        }
        elsif ( $name eq 'configuration' ) {
            $conf_of{$dev} = $text;
        }
    }

    sub symdev_char {
        my ( $expat, $string ) = @_;

        $text .= $string;
    }

    sub symdev_final {
        my ( $expat, $name, %atts ) = @_;

        for my $key ( keys %conf_of ) {
            print "$key\t$conf_of{$key}\n";
        }
    }
}
[download]

XML::Rules

Takes about 90 seconds but isn't doing much work

use strict;
use warnings;

use XML::Rules;

open my $symdev_list, '935.xml';

my @rules = (
    dev_name => sub{ my $dev = $_[1]->{_content} },
);


my $parser = XML::Rules->new(rules => \@rules);
$parser->parse( $symdev_list);

close $symdev_list;
[download]

Comment on Which XML parser would be the wisest to use Select or Download Code

Replies are listed 'Best First'.
Re: Which XML parser would be the wisest to use by runrig (Abbot) on Feb 21, 2008 at 01:12 UTC
I'm not sure what XML::Rules is using under the hood, but how do you know XML::SAX is using XML::SAX::ExpatXS? It might be using the PurePerl Parser, and so would be extremely slow. The alternative to XML::Parser (which uses expat) is XML::LibXML (which uses libxml) at the low level. Everything else is just wrappers around those, and bound to be 'slower', but possibly easier to use for your specific problem. Update: XML::Rules uses XML::Parser::Expat...but since it's a wrapper around Expat, it would be slower than the XML::Parser::Expat module alone...but faster/slower would not be the point of using XML::Rules over XML::Parser::Expat in the first place. When I ran your XML::SAX code, it used XML::LibXML (via XML::LibXML::SAX) under the hood, but then, I have XML::LibXML installed, and ParserDetails.ini is set to use it.	[reply]
Re^2: Which XML parser would be the wisest to use by wardy3 (Scribe) on Feb 21, 2008 at 01:32 UTC
Thanks, runrig I actually set $XML::SAX::ParserPackage in my SAX test and gave each of the modules a go. ExpatXS was the fastest, so I just used it. I thought XML::LibXML was a tree parser and I have had little luck with them as I run out of memory and my windoze session grinds to a halt :-( I quite like the feel of SAX and might just put up with the penalty but I wanted to get some opinions before just ploughing ahead and coding all the parsers I need. BTW, PurePerl took a lot longer. Here's the original run where I timed a few of the SAX modules :) `XML::LibXML::SAX real 1m9.658s user 1m3.186s sys 0m0.312s XML::SAX::Expat real 2m29.873s user 2m2.421s sys 0m0.389s XML::SAX::ExpatXS real 0m55.370s user 0m49.014s sys 0m0.311s XML::LibXML::SAX::Parser real 2m31.700s user 2m14.342s sys 0m0.483s XML::SAX::PurePerl real 5m2.766s user 4m23.733s sys 0m0.515s` [download]	[reply] [d/l]
Re^3: Which XML parser would be the wisest to use by runrig (Abbot) on Feb 21, 2008 at 04:28 UTC
XML::LibXML can be a purely SAX parser (XML::LibXML::SAX) if no DOM functions are used. XML::LibXML::SAX::Parser on the other hand says that it builds the DOM and then generates SAX events.	[reply]
Re^4: Which XML parser would be the wisest to use by Jenda (Abbot) on Feb 29, 2008 at 00:30 UTC
Re: Which XML parser would be the wisest to use by mirod (Canon) on Feb 21, 2008 at 05:45 UTC
Can your data fit in an XML::LibXML DOM (ie in XML::LibXML is tree mode)? If it does then go for it. That will be the fastest you can get in Perl. If not XML::Parser is probably the fastest you can get: XML::Twig and XML::Rules are based on it, and are usually slower. Surprisingly, I found that SAX, whether based on XML::Parser or on XML::LibXML, is very slow (see at the end of Simple Perl XML Benchmark).	[reply]
Re^2: Which XML parser would be the wisest to use by wardy3 (Scribe) on Feb 21, 2008 at 06:05 UTC
Thanks, Mirod! That's a great table. I noticed XSLT did will in the extract text from elements. I had thought about XSLT but assumed it'd be too slow. I tried to learn it a few years ago and maybe it's time to re-visit.	[reply]
Re^3: Which XML parser would be the wisest to use by mirod (Canon) on Feb 21, 2008 at 07:49 UTC
The thing is, XML::LibXSLT will load the entire document in memory. And as it is based on `libxml2`, just like XML::LibXML, it probably needs about the same amount of space as XML::LibXML. An alternate solution that I forgot to mention, mostly because I have never tried it and I don't know even if XML::LibXML supports it: `libxml2` has a pull mode, that you might be able to use to lower the memory requirements of your code (by deleting things you don't use any more in your DOM). If you go that route, it'd be interesting if you could describe how it works, because that could be a good alternative to SAX when processing huge documents.	[reply]
Re^4: Which XML parser would be the wisest to use by wardy3 (Scribe) on Feb 28, 2008 at 07:47 UTC
Re: Which XML parser would be the wisest to use by Skeeve (Parson) on Feb 21, 2008 at 10:35 UTC
The wisest parser to use is the one that fits your needs best. If all fit, use the one you understand the best. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^2: Which XML parser would be the wisest to use by wardy3 (Scribe) on Feb 28, 2008 at 07:50 UTC
Thanks, Skeeve. Very Zen. After my long journey, I agree. I really wanted speed and that's what I'm going for :-)	[reply]