I have lines of data to parse which look like the line below.

<abcd a="a1a1a1" bb="b1b1b1b1" cc="c1c1"/>

It is XML, but a severe subset:

- the XML element will always be on a single line
- names and values will vary
- number of attributes will vary
- no child elements to deal with

Originally this data was in CSV format

abcd,a1a1a1,b1b1b1b1,c1c1

I was using split originally on the CSV line so that's the timing that I'm comparing against.

The question: what is the fastest way to parse the pidgeon XML line?

I am assuming that because its a restricted form, we can gain speed with a hand-coded solution as compared to using a real XML parser.

I'd like the values to be parsed as

$1 = abcd
$2 = a
$3 = a1a1a1
$4 = bb
$5 = b1b1b1b1
$6 = cc
$7 = c1c1

I have used the regex below but its still almost a factor of 2 slower than using split on the CSV version:
use strict; use Benchmark; my $testline = '<someelement a="123" bbb="rrr sss ttt" cccc="14 or 15" +>'; my $xmlregex = qr/(?:\s+(\w+)=\"(.*?)\")|(?:^\s*<(\w+))|(?:>\s*$)/; my $xmlregex1 = qr/^\s*<([^\s]+)(.*)>\s*$/; my $xmlregex2 = qr/\s+(\w+)="([^"]+)"/; my @lines = ($testline) x 5000; my @example = (); sub useregex2 { my @items = (); foreach my $line (@lines) { if ($line =~ /$xmlregex1/o) { my ($element,$attribs) = ($1,$2); @items = grep length, split(/$xmlregex2/, $attribs); } else { print "useregex2: malmformed XML in $line\n"; exit; } } } sub usesplit { foreach my $line (@lines) { my @items = split(/\s+/,$line); } } timethese (500, { 'useregex2' => \&useregex2, # 'useregex' => \&useregex, 'usesplit' => \&usesplit, });

In reply to quickest way to parse pidgeon XML? by amazotron

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.