comment on

I would like to relate to the monastery some recent experiments on benchmarking the parsing speed of XML documents with XML::Simple, but with various parsers plugged-in at the back of it.

Background

There are many ways to parse XML, and perl provides more ways than probably any other language. I prefer to use XML::Simple where I can, as it allows me to quickly start solving the problem at hand, rather than be distracted by parser issues. That said, XML::Simple has some drawbacks too - it can be painfully slow, and has approximately a billion options. Once the learning curve of which options are usually required is overcome, your still stuck with the speed issue. Hopefully this meditation will help you make some informed choices in that regard.

Motivation

We have a client who provides all their data to us in XML files. We process this XML, and supply the client with a new data format, and other data, all burned onto a shiny CD, every day. The supplied XML is biggish - upto 2Mb, and we might get a hundred such files in a day.
My naive implemetation using XML::Simple was too slow - sometimes upto 6 minutes per file. It wasnt really the servers fault - it is an Enterprise-class Sun box with 6 processors and gigs of ram. Granted it is a busy machine, but performance was pathetic.
One thing holding me back from a wholesale rewrite closer to a low-level parser was that the implementation using XML::Simple was correct, and had taken a huge effort to get there. A rewrite would require re-verification all over again - no thanks.
I decided to see if a way could be found to get better performance whilst keeping our XML::Simple-based implemetation.

Preparation and Execution

I read the doco for XML::Simple again, paying special note of the sections 'SAX Support' and 'Environment'. I then downloaded a number of libraries and modules. Some come with perl, some with paricular OS vendors dists of perl. I recommend you scan your own system(s) to see what you may need.

expat - C library for parsing XML - 1.95.7
libxml2 - C library for parsing XML - 2.5.4
XML::Parser - perl wrapper around expat - 2.34
XML::LibXML - perl wrapper around libxml2 - 1.58
XML::SAX - perl module that supplies or consumes SAX events - 0.12
XML::SAX::Expat - backend for XML::SAX that uses the Expat library to supply SAX events - 0.37
XML::LibXML::SAX - backend for XML::SAX that uses the libxml2 library to supply SAX events - 1.00
XML::Simple - converts an XML document to a perl hash (roughly) - can have different backends drive this hash creation - 2.09

<plug mode="on">I wont go into using perl to do XML parsing, via the SAX and DOM paradigms - if you want to know more, come to my talk on this at OSDC conference.</plug>

Also, I used the following code - note it isn't like most benchmark code you see in the monastery - for a start it doesnt 'use Benchmark;'. This wasnt necessary as we are messuring an operation that takes 10's of seconds, not micro- and milli- seconds. The differences in speed standout without the help of the Benchmark module.
This code is pretty ugly too, but I believe it doesnt need a lot of trimming/reshaping. I was careful to make sure the measurements are tightly wrapped around the method/function calls, so as not to artificially inflate durations.

#!/usr/bin/perl -w

use strict;

use XML::Simple qw(:strict);
use Time::HiRes qw(time);
use File::stat;
use Test::More qw(no_plan);

my $xs;
my $XMLFile = $ARGV[0];
my $size = stat($XMLFile)->size();
print "File $XMLFile is " . $size . " bytes\n";

my ($start, $end);

$xs = XML::Simple->new(ForceArray => 0,
               KeyAttr    => {});

my $backend = '';
my $xml_default;
{
    local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend;

    $start = time();
    $xml_default = $xs->XMLin($XMLFile);
    $end = time();
}

print_result($backend, $end, $start, $size);

$backend = 'XML::Parser';
my $xml_x_p;
{
    local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend;

    $start = time();
    $xml_x_p = $xs->XMLin($XMLFile);
    $end = time();
}

print_result($backend, $end, $start, $size);

$backend = 'XML::SAX::Expat';
my $xml_x_s_e;
{
    local $ENV{XML_SIMPLE_PREFERRED_PARSER} = $backend;

    $start = time();
    $xml_x_s_e = $xs->XMLin($XMLFile);
    $end = time();
}

print_result($backend, $end, $start, $size);

$backend = 'XML::LibXML::SAX';
my $xml_x_l_s;
{
    local $ENV{XML_SIMPLE_PREFERRED_PARSER} = 'XML::LibXML::SAX';

    $start = time();
    $xml_x_l_s = $xs->XMLin($XMLFile);
    $end = time();
}

print_result($backend, $end, $start, $size);

is_deeply($xml_default, $xml_x_p);
is_deeply($xml_default, $xml_x_s_e);
is_deeply($xml_default, $xml_x_l_s);

sub print_result {
    my ($backend, $end, $start, $size) = @_;

    my $duration = $end - $start;

    print "XML::Simple with $backend backend took ", sprintf("%02.4f",
+ $duration), " seconds. ";

    print "This equates to ", sprintf("%02.4f", $size / ($duration)), 
+" kilobytes per second (1024 bytes per k)\n";
}
[download]

Notice the is_deeply() method calls - this is to confirm that the different backends all cause XML::Simple to generate the same data structure.

Results

[le6303@itdevtst perl]$ perl xml.pl bigxml.xml
File bigxml.xml is 1730463 bytes
XML::Simple with default backend took 12.9769 seconds. This equates to
+ 133349.4084 kilobytes per second (1024 bytes per k)
XML::Simple with XML::Parser backend took 3.6010 seconds. This equates
+ to 480549.2074 kilobytes per second (1024 bytes per k)
XML::Simple with XML::SAX::Expat backend took 13.6003 seconds. This eq
+uates to 127237.2038 kilobytes per second (1024 bytes per k)
XML::Simple with XML::LibXML::SAX backend took 6.3547 seconds. This eq
+uates to 272310.8906 kilobytes per second (1024 bytes per k)
ok 1
ok 2
ok 3
1..3
[download]

So XML::Parser 'wins', reducing runtimes to 26% of its slower cousings. On our platforms here we are actually seeing reductions to around 10% - mission accomplished.

Analysis

Tracing the code and comparing to the doco reveal

XML::Simple with the default backend and XML::Simple with the XML::SAX::Expat backend are actually the same operation.
If you do not have XML::SAX installed, XML::Simple with the default backend and XML::Simple with the XML::Parser backend are actually the same operation.
If you have both XML::SAX and XML::Parser installed, it would seem best to enable XML::Parser as the preferred parser, probably via the envvar. If a developer has a particular need of a SAX parser, he can override it via the package variable $XML::Simple::PREFERRED_PARSER - it has a higher priority than the envvar.

Update 16:53 23 Nov 04 Just for comparison, running XML::Parser alone on the same file in 'Subs' style takes, on average, 2.1 seconds.

use brain;

In reply to XML::Simple Benchmarks with various backends by leriksen

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.