comment on

I have lines of data to parse which look like the line below.

<abcd a="a1a1a1" bb="b1b1b1b1" cc="c1c1"/>

It is XML, but a severe subset:

- the XML element will always be on a single line
- names and values will vary
- number of attributes will vary
- no child elements to deal with

Originally this data was in CSV format

abcd,a1a1a1,b1b1b1b1,c1c1

I was using split originally on the CSV line so that's the timing that I'm comparing against.

The question: what is the fastest way to parse the pidgeon XML line?

I am assuming that because its a restricted form, we can gain speed with a hand-coded solution as compared to using a real XML parser.

I'd like the values to be parsed as

$1 = abcd
$2 = a
$3 = a1a1a1
$4 = bb
$5 = b1b1b1b1
$6 = cc
$7 = c1c1

I have used the regex below but its still almost a factor of 2 slower than using split on the CSV version:

use strict;
use Benchmark;

my $testline = '<someelement a="123" bbb="rrr sss ttt" cccc="14 or 15"
+>';
my $xmlregex = qr/(?:\s+(\w+)=\"(.*?)\")|(?:^\s*<(\w+))|(?:>\s*$)/;

my $xmlregex1 = qr/^\s*<([^\s]+)(.*)>\s*$/;
my $xmlregex2 = qr/\s+(\w+)="([^"]+)"/;

my @lines = ($testline) x 5000;

my @example = ();

sub useregex2
{
    my @items = ();

    foreach my $line (@lines)
    {
    if ($line =~ /$xmlregex1/o)
    {
        my ($element,$attribs) = ($1,$2);
        @items = grep length, split(/$xmlregex2/, $attribs);
    }
    else
    {
        print "useregex2: malmformed XML in $line\n";
        exit;
    }
    }
}

sub usesplit
{
    foreach my $line (@lines)
    {
    my @items = split(/\s+/,$line);
    }
}

timethese (500, {
    'useregex2' => \&useregex2,
#    'useregex' => \&useregex,
    'usesplit' => \&usesplit,
});
[download]

In reply to quickest way to parse pidgeon XML? by amazotron

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.