comment on

Hi Monks. Hope someone can point me in the right direction, please

I am trying to write an XML parser for a large-ish XML file. It's about 800,000 lines long but can get bigger (or smaller). The format of the file looks like:

<?xml version="1.0" standalone="yes" ?>

<abc>
  <def>
    <ghi>
      <jkl>information</jkl>
    </ghi>
    <important>
      <data>
        <info1>info 1</info1>
        <info2>info 2</info2>
        <info4>
            <data>data 4</data>
        </info4>
      </data>
      <data>
        <info1>info</info1>
        <info2>info</info2>
      </data>
    </important>
   </def>
</abc>
[download]

Note the duplicate tag "data" that has a different path

I have used XML::Parser before for parsing a similar file but thought I'd explore the world of the latest parsers. So far, I have tried XML::SAX (using XML::SAX::ExpatXS), XML::Twig and - just now - XML::Rules. I believe I need a stream parser because of the size, even though I need the path sometimes.

All these parsers seem significantly slower than the old-fashioned XML::Parser. I was a bit surprised at this given the age of it. My sample scripts are below, with timings. I'm running on cygwin using a Dell Latitude D610.

Should I move with the times, do you think?

Thanks, Michael

Here's some of the scripts I've been using to benchmark the parsers. Please note that I haven't put all the complications in here, just trying to get a base idea of how fast they can get through the file.

If I've missed something or done something a weird way, I'd appreciate the feedback. I'm tracking the element path as I need to use it in the real version.

SAX

Takes about 1 minute

use strict;
use XML::SAX;

my $parser = XML::SAX::ParserFactory->parser(
    Handler => symdevHandler->new
);

open my $symdev_list, '935.xml';
$parser->parse_file($symdev_list);
close $symdev_list;

package symdevHandler;
use base qw(XML::SAX::Base);

my ( $dev );
my @element_stack;
my %conf_of;

sub start_element {
    my ($self, $el) = @_;

    push @element_stack, $el->{Name};
}

sub end_element {
    my ($self, $el) = @_;

    pop @element_stack;
}

sub characters {
    my ($self, $el) = @_;

    my $text = $el->{Data};

    my $in_element = $element_stack[-1];

    if ( $in_element eq 'dev_name' ) {
        $dev = $text;
    }
    elsif ( $in_element eq 'configuration' ) {
        $conf_of{$dev} = $text;
    }

}

sub end_document {
    my ($self, $el) = @_;

    for my $key ( keys %conf_of ) {
        print "$key\t$conf_of{$key}\n";
    }
}

1;
[download]

XML::Parser

Takes about 40 seconds

use strict;
use XML::Parser;
# Set up the XML parser to point to standard symdev processing subrout
+ines
my $parser = XML::Parser->new(
    Handlers => {
        Start => \&symdev_start,
        End   => \&symdev_end,
        Final => \&symdev_final,
        Char  => \&symdev_char,
    }
);

open my $symdev_list, '935.xml';
$parser->parse($symdev_list);
close $symdev_list;

{
    my ( $dev, $text );
    my @element_stack;
    my %conf_of;

    sub symdev_start {
        my ( $expat, $name, %atts ) = @_;

        push @element_stack, $name;

        $text = '';
    }
    sub symdev_end {
        my ( $expat, $name, %atts ) = @_;

        pop @element_stack;

        if ( $name eq 'dev_name' ) {
            $dev = $text;
        }
        elsif ( $name eq 'configuration' ) {
            $conf_of{$dev} = $text;
        }
    }

    sub symdev_char {
        my ( $expat, $string ) = @_;

        $text .= $string;
    }

    sub symdev_final {
        my ( $expat, $name, %atts ) = @_;

        for my $key ( keys %conf_of ) {
            print "$key\t$conf_of{$key}\n";
        }
    }
}
[download]

XML::Rules

Takes about 90 seconds but isn't doing much work

use strict;
use warnings;

use XML::Rules;

open my $symdev_list, '935.xml';

my @rules = (
    dev_name => sub{ my $dev = $_[1]->{_content} },
);


my $parser = XML::Rules->new(rules => \@rules);
$parser->parse( $symdev_list);

close $symdev_list;
[download]

In reply to Which XML parser would be the wisest to use by wardy3

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.