XML::Simple hangs

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: XML::Simple hangs
by mifflin (Curate) on Jun 04, 2005 at 06:50 UTC

use strict;
use warnings;
use XML::SAX;

my $h = Handler->new();
my $p = XML::SAX::ParserFactory->parser(Handler => $h)
    or die "Unable to get XML SAX parser object";
$p->parse_uri("10.xml");

BEGIN {
package Handler;

sub new {
    my $class = shift;
    bless({chars => ''}, ref $class || $class);
}

sub start_element {
    my ($this, $data) = @_;
    $this->{chars} = '';
    print "start $data->{Name}\n";
}

sub end_element {
    my ($this, $data) = @_;
    print "end $data->{Name} = ", length($this->{chars}), "\n";
    $this->{chars} = '';
}

sub characters  {
    my ($this, $data) = @_;
    my $chars = $data->{Data};
    $this->{chars} .= $chars;
    print "characters ",length($chars),"\n";
}
}
[download]

Read more... (2 kB)

[reply]
[d/l]
[select]

Re: XML::Simple hangs
by grantm (Parson) on Jun 05, 2005 at 05:13 UTC

'Hang' is such an ugly word :-) On my Athlon 2600 laptop it took nearly 2 hours to complete, but it did eventually finish.

I'm guessing that on your system XML::Simple is using XML::Parser. Internally, XML::Simple does something like this:

  my $xp = XML::Parser->new(Style => 'Tree');
  my $tree = $xp->parse($document);
[download]

and then proceeds to reduce $tree into something simpler.

If you try that snippet on your document, then I suspect you'll see similarly long processing times. I'm not sure why expat is so slow processing your XML, but it is rather 'unusual' XML.

Your original document had over 100,000 lines. I benchmarked passing shorter versions of the document through different parser modules using XML::Simple. (If you install XML::SAX, you can chose alternative parsers, by assigning a parser module name to $XML::Simple::PREFERRED_PARSER). I would caution against reading anything at all into these results in a general sense but in the case of your specific data they are interesting:

Parser	5000 lines	10000 lines	15000 lines	20000 lines	25000 lines
XML::Parser	8	40	92	169	267
XML::SAX::Expat	8	40	95	173	272
XML::SAX::ExpatXS	4	27	66	121	191
XML::SAX::PurePerl	9	20	29	39	49
XML::LibXML::SAX::Parser	1	1	1	1	1

As you can see, the run times for all the expat based parsers are increasing exponentially with the size of the file. The PurePerl parser is coping amazingly well and its run times are only increasing ~~geometrically~~linearly. The libxml based SAX parser is the clear winner in this race with run times so short I can't see an increase at all.

Just to illustrate how unusual your XML is compared to 'normal' XML, I changed the 20,000 line file so that instead of looking like this...

<data>
SUQzAgAAAAAQaVRUMgAAEQBrZWVwIFlhIEhlYWQgVXAAVFAxAAAUAEhhcmxlbSB0aG
cwBUQUwAABQARnJlZSBBZ2VudC9EZWMuIDA0AFRSSwAABQAzLzMAVEVOAAANAGlUdW
AENPTQAAaABlbmdpVHVuTk9STQAgMDAwMDA0NUEgMDAwMDAwMDMgMDAwMDMwRjAgMD
MDAwMENBNTggMDAwMkNDNkQgMDAwMDgyM0MgMDAwMDgwMEQgMDAwNDAxODIgMDAwND
...
</data>
[download]

it looked like this ...

<data>
<line>SUQzAgAAAAAQaVRUMgAAEQBrZWVwIFlhIEhlYWQgVXAAVFAxAAAUAEhhcmxlbSB0
+aG</line>
<line>cwBUQUwAABQARnJlZSBBZ2VudC9EZWMuIDA0AFRSSwAABQAzLzMAVEVOAAANAGlU
+dW</line>
<line>AENPTQAAaABlbmdpVHVuTk9STQAgMDAwMDA0NUEgMDAwMDAwMDMgMDAwMDMwRjAg
+MD</line>
<line>MDAwMENBNTggMDAwMkNDNkQgMDAwMDgyM0MgMDAwMDgwMEQgMDAwNDAxODIgMDAw
+ND</line>
...
</data>
[download]

You might imagine that by introducing all those extra <line> tags and the implicit extra structure, that the parsers would have more work to do and would therefore be slower. But here are the timings for processing the modified version:

Parser	20000 lines
XML::Parser	2
XML::SAX::Expat	7
XML::SAX::ExpatXS	5
XML::SAX::PurePerl	73
XML::LibXML::SAX::Parser	7

Yes, you read that right, the run time using XML::Parser dropped from 169 seconds to 2! This is a much more 'normal' result for a parser shootout. PurePerl is the clear loser and expat is the clear winner. LibXML is looking pretty good, but that parser's advantages don't really shine unless you're building DOM trees - when it moves into a class of its own.

So what's causing your problem? Dunno. You seem to have triggered some pathological corner case in expat. But really, it seems to me like a case of "Doctor it hurts when I do this". Had you considered passing just the filename for the MP3 file?

Update: Yeah thanks dbecoll - I obviously had my brain disengaged when I typed that.

[reply]
[d/l]
[select]

Re^2: XML::Simple hangs

by dbecoll (Initiate) on Jun 05, 2005 at 07:01 UTC

> As you can see, the run times for all the expat based
> parsers are increasing exponentially with the size of the
> file. The PurePerl parser is coping amazingly well and its 
> run times are only increasing geometrically.

I hope you meant linearly rather than geometrically. ;)

[reply]

Re^2: XML::Simple hangs

by Ovid (Cardinal) on Jun 05, 2005 at 14:22 UTC

Thank you very much. Regrettably, just passing the filename isn't an option in this case. This XML is being pass via SOAP and, as a result, the SOAP server will sometimes be remote and not always have access to the mp3. However, you've definitely given me some great information.

Cheers,
Ovid

New address of my CGI Course.

[reply]

Re: XML::Simple hangs
by mirod (Canon) on Jun 04, 2005 at 09:30 UTC

Which parser is XML::Simple using? Did you trying using one of the SAX parsers (not XML::SAX::PurePerl, but maybe the XML::LibXML one)?

[reply]

Re^2: XML::Simple hangs

by Ovid (Cardinal) on Jun 04, 2005 at 20:37 UTC

I didn't know there was an option. In reading through the docs, I'm not sure how to switch it to a SAX parser.

Cheers,
Ovid

New address of my CGI Course.

[reply]

Re^3: XML::Simple hangs

by bmann (Priest) on Jun 04, 2005 at 22:03 UTC

perldoc XML::Simple

You can dictate which parser module is used by setting either the envi
+-
ronment variable 'XML_SIMPLE_PREFERRED_PARSER' or the package variable
$XML::Simple::PREFERRED_PARSER to contain the module name.
[download]

$XML::Simple::PREFERRED_PARSER = 'XML::LibXML::SAX'; should do the trick.

Update: Results after changing parser (Debian unstable, P4 2.8G, 1G RAM):

#!/usr/bin/perl
use warnings;
use strict;
use XML::Simple qw/XMLin XMLout/;

my $file = shift || die "Must supply xml file";
open FH, '<', $file or die "Cannot open $file for reading: $!";
my $document = do { local $/; <FH> };
$XML::Simple::PREFERRED_PARSER = 'XML::LibXML::SAX';
my $ref = XMLin( $document );

__END__
Output:
8787248
real    0m1.293s
user    0m0.752s
sys     0m0.526s
[download]

[reply]
[d/l]
[select]