Ovid has asked for the wisdom of the Perl Monks concerning the following question:

In trying to parse some very simple, but large XML files, I've found that XML::Simple hangs. The following example demonstrates the problem:

#!/usr/bin/perl use warnings; use strict; use XML::Simple qw/XMLin/; my $file = shift || die "Must supply xml file"; open FH, '<', $file or die "Cannot open $file for reading: $!"; my $document = do { local $/; <FH> }; XMLin($document);

You can down load a compressed (6.2 Meg) version of the XML that causes this to choke. The XML itself is rather simple. It's a huge base64 encoded mp3 that is causing problems. As near as I can tell, the actual problem is an infinite loop in XML::Parser::Style::Tree.

XML::Simple is the latest version and expat is 1.95.5.

Anyone seen this problem before? (Wrapping the base64 encoded data in a CDATA section has no effect).

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: XML::Simple hangs
by mifflin (Curate) on Jun 04, 2005 at 06:50 UTC
    I get the same XML::Simple behavior as you. I got tired of waiting for the program to end and broke it. That put my machine into a wierd state and I was forced to reboot. (damn win2k).
    I then wrote a little XML::SAX routine to see what was happening (see below). Doesn't XML::Simple and XML::SAX both use XML::Parser under the hood? Anyways, I appears that once it gets to the data element the character sub gets called once for each line of data, 76 chars, and then once again for each newline. This continues for some time, but it does finish with the data element containing 8786162 chars.
    use strict; use warnings; use XML::SAX; my $h = Handler->new(); my $p = XML::SAX::ParserFactory->parser(Handler => $h) or die "Unable to get XML SAX parser object"; $p->parse_uri("10.xml"); BEGIN { package Handler; sub new { my $class = shift; bless({chars => ''}, ref $class || $class); } sub start_element { my ($this, $data) = @_; $this->{chars} = ''; print "start $data->{Name}\n"; } sub end_element { my ($this, $data) = @_; print "end $data->{Name} = ", length($this->{chars}), "\n"; $this->{chars} = ''; } sub characters { my ($this, $data) = @_; my $chars = $data->{Data}; $this->{chars} .= $chars; print "characters ",length($chars),"\n"; } }
    Here's a sample of the output...
Re: XML::Simple hangs
by grantm (Parson) on Jun 05, 2005 at 05:13 UTC

    'Hang' is such an ugly word :-) On my Athlon 2600 laptop it took nearly 2 hours to complete, but it did eventually finish.

    I'm guessing that on your system XML::Simple is using XML::Parser. Internally, XML::Simple does something like this:

    my $xp = XML::Parser->new(Style => 'Tree'); my $tree = $xp->parse($document);

    and then proceeds to reduce $tree into something simpler.

    If you try that snippet on your document, then I suspect you'll see similarly long processing times. I'm not sure why expat is so slow processing your XML, but it is rather 'unusual' XML.

    Your original document had over 100,000 lines. I benchmarked passing shorter versions of the document through different parser modules using XML::Simple. (If you install XML::SAX, you can chose alternative parsers, by assigning a parser module name to $XML::Simple::PREFERRED_PARSER). I would caution against reading anything at all into these results in a general sense but in the case of your specific data they are interesting:

    Parser5000 lines10000 lines15000 lines20000 lines25000 lines
    XML::Parser84092169267
    XML::SAX::Expat84095173272
    XML::SAX::ExpatXS42766121191
    XML::SAX::PurePerl920293949
    XML::LibXML::SAX::Parser11111

    As you can see, the run times for all the expat based parsers are increasing exponentially with the size of the file. The PurePerl parser is coping amazingly well and its run times are only increasing geometricallylinearly. The libxml based SAX parser is the clear winner in this race with run times so short I can't see an increase at all.

    Just to illustrate how unusual your XML is compared to 'normal' XML, I changed the 20,000 line file so that instead of looking like this...

    <data> SUQzAgAAAAAQaVRUMgAAEQBrZWVwIFlhIEhlYWQgVXAAVFAxAAAUAEhhcmxlbSB0aG cwBUQUwAABQARnJlZSBBZ2VudC9EZWMuIDA0AFRSSwAABQAzLzMAVEVOAAANAGlUdW AENPTQAAaABlbmdpVHVuTk9STQAgMDAwMDA0NUEgMDAwMDAwMDMgMDAwMDMwRjAgMD MDAwMENBNTggMDAwMkNDNkQgMDAwMDgyM0MgMDAwMDgwMEQgMDAwNDAxODIgMDAwND ... </data>

    it looked like this ...

    <data> <line>SUQzAgAAAAAQaVRUMgAAEQBrZWVwIFlhIEhlYWQgVXAAVFAxAAAUAEhhcmxlbSB0 +aG</line> <line>cwBUQUwAABQARnJlZSBBZ2VudC9EZWMuIDA0AFRSSwAABQAzLzMAVEVOAAANAGlU +dW</line> <line>AENPTQAAaABlbmdpVHVuTk9STQAgMDAwMDA0NUEgMDAwMDAwMDMgMDAwMDMwRjAg +MD</line> <line>MDAwMENBNTggMDAwMkNDNkQgMDAwMDgyM0MgMDAwMDgwMEQgMDAwNDAxODIgMDAw +ND</line> ... </data>

    You might imagine that by introducing all those extra <line> tags and the implicit extra structure, that the parsers would have more work to do and would therefore be slower. But here are the timings for processing the modified version:

    Parser20000 lines
    XML::Parser2
    XML::SAX::Expat7
    XML::SAX::ExpatXS5
    XML::SAX::PurePerl73
    XML::LibXML::SAX::Parser7

    Yes, you read that right, the run time using XML::Parser dropped from 169 seconds to 2! This is a much more 'normal' result for a parser shootout. PurePerl is the clear loser and expat is the clear winner. LibXML is looking pretty good, but that parser's advantages don't really shine unless you're building DOM trees - when it moves into a class of its own.

    So what's causing your problem? Dunno. You seem to have triggered some pathological corner case in expat. But really, it seems to me like a case of "Doctor it hurts when I do this". Had you considered passing just the filename for the MP3 file?

    Update: Yeah thanks dbecoll - I obviously had my brain disengaged when I typed that.

      > As you can see, the run times for all the expat based
      > parsers are increasing exponentially with the size of the
      > file. The PurePerl parser is coping amazingly well and its 
      > run times are only increasing geometrically.
      
      I hope you meant linearly rather than geometrically. ;)
      

      Thank you very much. Regrettably, just passing the filename isn't an option in this case. This XML is being pass via SOAP and, as a result, the SOAP server will sometimes be remote and not always have access to the mp3. However, you've definitely given me some great information.

      Cheers,
      Ovid

      New address of my CGI Course.

Re: XML::Simple hangs
by mirod (Canon) on Jun 04, 2005 at 09:30 UTC

    Which parser is XML::Simple using? Did you trying using one of the SAX parsers (not XML::SAX::PurePerl, but maybe the XML::LibXML one)?

      I didn't know there was an option. In reading through the docs, I'm not sure how to switch it to a SAX parser.

      Cheers,
      Ovid

      New address of my CGI Course.

        From perldoc XML::Simple:

        You can dictate which parser module is used by setting either the envi +- ronment variable 'XML_SIMPLE_PREFERRED_PARSER' or the package variable $XML::Simple::PREFERRED_PARSER to contain the module name.

        $XML::Simple::PREFERRED_PARSER = 'XML::LibXML::SAX'; should do the trick.

        Update: Results after changing parser (Debian unstable, P4 2.8G, 1G RAM):

        #!/usr/bin/perl use warnings; use strict; use XML::Simple qw/XMLin XMLout/; my $file = shift || die "Must supply xml file"; open FH, '<', $file or die "Cannot open $file for reading: $!"; my $document = do { local $/; <FH> }; $XML::Simple::PREFERRED_PARSER = 'XML::LibXML::SAX'; my $ref = XMLin( $document ); __END__ Output: 8787248 real 0m1.293s user 0m0.752s sys 0m0.526s