I have lines of data to parse which look like the line below.
<abcd a="a1a1a1" bb="b1b1b1b1" cc="c1c1"/>
It is XML, but a severe subset:
- the XML element will always be on a single line
- names and values will vary
- number of attributes will vary
- no child elements to deal with
Originally this data was in CSV format
abcd,a1a1a1,b1b1b1b1,c1c1
I was using split originally on the CSV line so that's the timing that I'm comparing against.
The question: what is the fastest way to parse the pidgeon XML line?
I am assuming that because its a restricted form, we can gain speed with a hand-coded solution as compared to using a real XML parser.
I'd like the values to be parsed as
$1 = abcd
$2 = a
$3 = a1a1a1
$4 = bb
$5 = b1b1b1b1
$6 = cc
$7 = c1c1
I have used the regex below but its still almost a factor of 2 slower than using split on the CSV version:
use strict;
use Benchmark;
my $testline = '<someelement a="123" bbb="rrr sss ttt" cccc="14 or 15"
+>';
my $xmlregex = qr/(?:\s+(\w+)=\"(.*?)\")|(?:^\s*<(\w+))|(?:>\s*$)/;
my $xmlregex1 = qr/^\s*<([^\s]+)(.*)>\s*$/;
my $xmlregex2 = qr/\s+(\w+)="([^"]+)"/;
my @lines = ($testline) x 5000;
my @example = ();
sub useregex2
{
my @items = ();
foreach my $line (@lines)
{
if ($line =~ /$xmlregex1/o)
{
my ($element,$attribs) = ($1,$2);
@items = grep length, split(/$xmlregex2/, $attribs);
}
else
{
print "useregex2: malmformed XML in $line\n";
exit;
}
}
}
sub usesplit
{
foreach my $line (@lines)
{
my @items = split(/\s+/,$line);
}
}
timethese (500, {
'useregex2' => \&useregex2,
# 'useregex' => \&useregex,
'usesplit' => \&usesplit,
});
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.