Hi Monks. Hope someone can point me in the right direction, please

I am trying to write an XML parser for a large-ish XML file. It's about 800,000 lines long but can get bigger (or smaller). The format of the file looks like:

<?xml version="1.0" standalone="yes" ?> <abc> <def> <ghi> <jkl>information</jkl> </ghi> <important> <data> <info1>info 1</info1> <info2>info 2</info2> <info4> <data>data 4</data> </info4> </data> <data> <info1>info</info1> <info2>info</info2> </data> </important> </def> </abc>

Note the duplicate tag "data" that has a different path

I have used XML::Parser before for parsing a similar file but thought I'd explore the world of the latest parsers. So far, I have tried XML::SAX (using XML::SAX::ExpatXS), XML::Twig and - just now - XML::Rules. I believe I need a stream parser because of the size, even though I need the path sometimes.

All these parsers seem significantly slower than the old-fashioned XML::Parser. I was a bit surprised at this given the age of it. My sample scripts are below, with timings. I'm running on cygwin using a Dell Latitude D610.

Should I move with the times, do you think?

Thanks, Michael

Here's some of the scripts I've been using to benchmark the parsers. Please note that I haven't put all the complications in here, just trying to get a base idea of how fast they can get through the file.

If I've missed something or done something a weird way, I'd appreciate the feedback. I'm tracking the element path as I need to use it in the real version.

SAX

Takes about 1 minute

use strict; use XML::SAX; my $parser = XML::SAX::ParserFactory->parser( Handler => symdevHandler->new ); open my $symdev_list, '935.xml'; $parser->parse_file($symdev_list); close $symdev_list; package symdevHandler; use base qw(XML::SAX::Base); my ( $dev ); my @element_stack; my %conf_of; sub start_element { my ($self, $el) = @_; push @element_stack, $el->{Name}; } sub end_element { my ($self, $el) = @_; pop @element_stack; } sub characters { my ($self, $el) = @_; my $text = $el->{Data}; my $in_element = $element_stack[-1]; if ( $in_element eq 'dev_name' ) { $dev = $text; } elsif ( $in_element eq 'configuration' ) { $conf_of{$dev} = $text; } } sub end_document { my ($self, $el) = @_; for my $key ( keys %conf_of ) { print "$key\t$conf_of{$key}\n"; } } 1;
XML::Parser

Takes about 40 seconds

use strict; use XML::Parser; # Set up the XML parser to point to standard symdev processing subrout +ines my $parser = XML::Parser->new( Handlers => { Start => \&symdev_start, End => \&symdev_end, Final => \&symdev_final, Char => \&symdev_char, } ); open my $symdev_list, '935.xml'; $parser->parse($symdev_list); close $symdev_list; { my ( $dev, $text ); my @element_stack; my %conf_of; sub symdev_start { my ( $expat, $name, %atts ) = @_; push @element_stack, $name; $text = ''; } sub symdev_end { my ( $expat, $name, %atts ) = @_; pop @element_stack; if ( $name eq 'dev_name' ) { $dev = $text; } elsif ( $name eq 'configuration' ) { $conf_of{$dev} = $text; } } sub symdev_char { my ( $expat, $string ) = @_; $text .= $string; } sub symdev_final { my ( $expat, $name, %atts ) = @_; for my $key ( keys %conf_of ) { print "$key\t$conf_of{$key}\n"; } } }
XML::Rules

Takes about 90 seconds but isn't doing much work

use strict; use warnings; use XML::Rules; open my $symdev_list, '935.xml'; my @rules = ( dev_name => sub{ my $dev = $_[1]->{_content} }, ); my $parser = XML::Rules->new(rules => \@rules); $parser->parse( $symdev_list); close $symdev_list;

In reply to Which XML parser would be the wisest to use by wardy3

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.