quickest way to parse pidgeon XML?

amazotron has asked for the wisdom of the Perl Monks concerning the following question:

I have lines of data to parse which look like the line below.

<abcd a="a1a1a1" bb="b1b1b1b1" cc="c1c1"/>

It is XML, but a severe subset:

- the XML element will always be on a single line
- names and values will vary
- number of attributes will vary
- no child elements to deal with

Originally this data was in CSV format

abcd,a1a1a1,b1b1b1b1,c1c1

I was using split originally on the CSV line so that's the timing that I'm comparing against.

The question: what is the fastest way to parse the pidgeon XML line?

I am assuming that because its a restricted form, we can gain speed with a hand-coded solution as compared to using a real XML parser.

I'd like the values to be parsed as

$1 = abcd
$2 = a
$3 = a1a1a1
$4 = bb
$5 = b1b1b1b1
$6 = cc
$7 = c1c1

I have used the regex below but its still almost a factor of 2 slower than using split on the CSV version:

use strict;
use Benchmark;

my $testline = '<someelement a="123" bbb="rrr sss ttt" cccc="14 or 15"
+>';
my $xmlregex = qr/(?:\s+(\w+)=\"(.*?)\")|(?:^\s*<(\w+))|(?:>\s*$)/;

my $xmlregex1 = qr/^\s*<([^\s]+)(.*)>\s*$/;
my $xmlregex2 = qr/\s+(\w+)="([^"]+)"/;

my @lines = ($testline) x 5000;

my @example = ();

sub useregex2
{
    my @items = ();

    foreach my $line (@lines)
    {
    if ($line =~ /$xmlregex1/o)
    {
        my ($element,$attribs) = ($1,$2);
        @items = grep length, split(/$xmlregex2/, $attribs);
    }
    else
    {
        print "useregex2: malmformed XML in $line\n";
        exit;
    }
    }
}

sub usesplit
{
    foreach my $line (@lines)
    {
    my @items = split(/\s+/,$line);
    }
}

timethese (500, {
    'useregex2' => \&useregex2,
#    'useregex' => \&useregex,
    'usesplit' => \&usesplit,
});
[download]

Comment on quickest way to parse pidgeon XML? Download Code

Replies are listed 'Best First'.
Re: quickest way to parse pidgeon XML? by CountZero (Bishop) on May 28, 2004 at 16:37 UTC
`Split` seems entirely the better solution here. That being said, a good rule is that anything XML should be parsed and not reg-exed. You will always start with something simple, add a few things left and right and before you know it, it gets so complex that you are re-inventing an XML-parser in order to read it. Better to use the original CSV-file and let it be handled by Text::CSV (which will nicely handle any escapes for "forbidden" characters). CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l]
Re: quickest way to parse pidgeon XML? by samtregar (Abbot) on May 28, 2004 at 18:31 UTC
Originally this data was in CSV format Why did it change? If you have a flat data-structure then CSV is obviously a better choice than XML. It's more compact, probably faster to parse and probably faster to generate. I'd say go back to CSV and check out the excellent Text::CSV_XS. -sam	[reply]
Re: Re: quickest way to parse pidgeon XML? by amazotron (Novice) on May 31, 2004 at 21:23 UTC
I appreciate all the sentiments. The large data sets are still CSV-based, but I have need to start storing more information such that the data is no longer flat. XML seems like an ideal mechanism (or possible a set of datbase tables), but I'm very concerned with speed. The current implementation is not the speediest and any real slowdown is going to be noticed. I control both the reading and writing of the files, and I thought it would be ideal to use a subset of XML (for speed). Allan	[reply]
Re: Re: Re: quickest way to parse pidgeon XML? by samtregar (Abbot) on Jun 01, 2004 at 03:28 UTC
You need to drop XML like a hot rock if speed is a primary concern and you control both sides of the transaction. XML isn't optimized for speed, it's optimized for readability and extensibility. Even if you cheat by writing your own regexes you'll still have to contend with the overhead added by all the "<foo></foo>" repetition. Have you considered Storable? As far as serializing Perl data-structures goes it's the undisputed speed king. Depending on your access pattern you might also consider DB_File. When used correctly it can be quite fast. -sam	[reply]
Re: Re: Re: Re: quickest way to parse pidgeon XML? by amazotron (Novice) on Jun 01, 2004 at 14:24 UTC