wardy3 has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to write an XML parser for a large-ish XML file. It's about 800,000 lines long but can get bigger (or smaller). The format of the file looks like:
<?xml version="1.0" standalone="yes" ?> <abc> <def> <ghi> <jkl>information</jkl> </ghi> <important> <data> <info1>info 1</info1> <info2>info 2</info2> <info4> <data>data 4</data> </info4> </data> <data> <info1>info</info1> <info2>info</info2> </data> </important> </def> </abc>
Note the duplicate tag "data" that has a different path
I have used XML::Parser before for parsing a similar file but thought I'd explore the world of the latest parsers. So far, I have tried XML::SAX (using XML::SAX::ExpatXS), XML::Twig and - just now - XML::Rules. I believe I need a stream parser because of the size, even though I need the path sometimes.
All these parsers seem significantly slower than the old-fashioned XML::Parser. I was a bit surprised at this given the age of it. My sample scripts are below, with timings. I'm running on cygwin using a Dell Latitude D610.
Should I move with the times, do you think?
Thanks, Michael
Here's some of the scripts I've been using to benchmark the parsers. Please note that I haven't put all the complications in here, just trying to get a base idea of how fast they can get through the file.
If I've missed something or done something a weird way, I'd appreciate the feedback. I'm tracking the element path as I need to use it in the real version.
SAXTakes about 1 minute
XML::Parseruse strict; use XML::SAX; my $parser = XML::SAX::ParserFactory->parser( Handler => symdevHandler->new ); open my $symdev_list, '935.xml'; $parser->parse_file($symdev_list); close $symdev_list; package symdevHandler; use base qw(XML::SAX::Base); my ( $dev ); my @element_stack; my %conf_of; sub start_element { my ($self, $el) = @_; push @element_stack, $el->{Name}; } sub end_element { my ($self, $el) = @_; pop @element_stack; } sub characters { my ($self, $el) = @_; my $text = $el->{Data}; my $in_element = $element_stack[-1]; if ( $in_element eq 'dev_name' ) { $dev = $text; } elsif ( $in_element eq 'configuration' ) { $conf_of{$dev} = $text; } } sub end_document { my ($self, $el) = @_; for my $key ( keys %conf_of ) { print "$key\t$conf_of{$key}\n"; } } 1;
Takes about 40 seconds
XML::Rulesuse strict; use XML::Parser; # Set up the XML parser to point to standard symdev processing subrout +ines my $parser = XML::Parser->new( Handlers => { Start => \&symdev_start, End => \&symdev_end, Final => \&symdev_final, Char => \&symdev_char, } ); open my $symdev_list, '935.xml'; $parser->parse($symdev_list); close $symdev_list; { my ( $dev, $text ); my @element_stack; my %conf_of; sub symdev_start { my ( $expat, $name, %atts ) = @_; push @element_stack, $name; $text = ''; } sub symdev_end { my ( $expat, $name, %atts ) = @_; pop @element_stack; if ( $name eq 'dev_name' ) { $dev = $text; } elsif ( $name eq 'configuration' ) { $conf_of{$dev} = $text; } } sub symdev_char { my ( $expat, $string ) = @_; $text .= $string; } sub symdev_final { my ( $expat, $name, %atts ) = @_; for my $key ( keys %conf_of ) { print "$key\t$conf_of{$key}\n"; } } }
Takes about 90 seconds but isn't doing much work
use strict; use warnings; use XML::Rules; open my $symdev_list, '935.xml'; my @rules = ( dev_name => sub{ my $dev = $_[1]->{_content} }, ); my $parser = XML::Rules->new(rules => \@rules); $parser->parse( $symdev_list); close $symdev_list;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Which XML parser would be the wisest to use
by runrig (Abbot) on Feb 21, 2008 at 01:12 UTC | |
by wardy3 (Scribe) on Feb 21, 2008 at 01:32 UTC | |
by runrig (Abbot) on Feb 21, 2008 at 04:28 UTC | |
by Jenda (Abbot) on Feb 29, 2008 at 00:30 UTC | |
|
Re: Which XML parser would be the wisest to use
by mirod (Canon) on Feb 21, 2008 at 05:45 UTC | |
by wardy3 (Scribe) on Feb 21, 2008 at 06:05 UTC | |
by mirod (Canon) on Feb 21, 2008 at 07:49 UTC | |
by wardy3 (Scribe) on Feb 28, 2008 at 07:47 UTC | |
|
Re: Which XML parser would be the wisest to use
by Skeeve (Parson) on Feb 21, 2008 at 10:35 UTC | |
by wardy3 (Scribe) on Feb 28, 2008 at 07:50 UTC |