comment on

I think $attValue could safely use non-backtracking groups. I think that ($attribute*?) really should be ($attribute*), because if you have multiple attributes, the RE engine first try to match zero, then one, then two etc., instead of matching them all in the first place.

(.*) # what's between tags, even newline
[download]

Are you sure you want that greedy, and with no further restrictions? Maybe it's appropriate in your case, but in the general case it's not good when parsing XML ;-)

Shortly after the release of perl 5.10.0 I wrote an "XML" regex, mostly as an exercise for the cool new regex features. It parses only a small subset of XML, but maybe it's of use to you:


#!/usr/bin/env perl5.10.0
use strict;
use warnings;
use re 'eval';
use Test::More qw(no_plan);
use 5.010;
use Data::Dumper;


my $xml;
my $nested_tags;

my $cdata = qr{ (?> 
       [^<>&"]+     # any amount of "normal" text
     | \&\w+;       # named chars
     | \&\#\d+;     # numbered codepoints
)}x;

$xml = qr{
    (?> (??{$nested_tags}) | $cdata)+
}x;

#$xml = qr/ $single_xml+/;


my $name = qr{
    (?>\w+(?: [:-]\w+)*)
}x;
my $attribute = qr{
    (?>$name="$cdata*+")
}x;

{
    $nested_tags = qr{
        (?<nested_tags>
        <   
            ($name) 
#            (?{print "after <$^N: \n"})
#            (?{print "match: [$&] (\$2:$2) \n"})
            (?>\s+$attribute)*\s*
            (?:
                /\s*>                # either an empty tag end ...
    
            | >                      # or end-of-tag and
            (?> (?&nested_tags) | $cdata)*+ # arbitrary XML
                </\s* (??{$2})\s*> # and a closing tag containing
                                    # the current name
            )
        )
    }x;
}


like   "foo bar baz",       qr/^$cdata$/, "cdata";
unlike "<bla>",             qr/^$cdata$/, "cdata";
like   'blerk="foo"',       qr/^$attribute$/, "simple attribute";
unlike 'blerk=bar',         qr/^$attribute$/, "non-quoted attribute";
like   '<bla />',           qr/^$nested_tags$/, "single, empty XML tag
+";
unlike '<bla>',             qr/^$nested_tags$/, "single, non-empty XML
+ tag";
like '<bla></bla>',         qr/^$nested_tags$/, "single, closed XML ta
+g";
like '<bla><blubb/></bla>', qr/^$nested_tags$/, "nested tags 1";
like '<bla>foo</bla>',      qr/^$nested_tags$/, "nested tags 2";
unlike '<bla><blubb></bla>',qr/^$nested_tags$/, "nested tags 3";
like '<bla><blubb></blubb></bla>',
                            qr/^$nested_tags$/, "nested tags 4";
like '<moep><blubb></blubb></moep><foo/><bar></bar>',
                            qr/^$xml+$/, 'multiple nested tags';
unlike '<bla><blubb></foo></bla>',
                            qr/^$xml+$/, "wrongly nested tags";
like '<bla>foo</bla>',      qr/^$xml+$/, "nested tags with cdata";
like '<bla foo="bar" />',   qr/^$nested_tags$/, 'Tag with attribute';
like '<bla>foo &auml;blubb</bla>',
                            qr/^$nested_tags$/, 'Tags with named entit
+ies';
unlike '<bla>foo &aumlblubb</bla>',
                            qr/^$nested_tags$/, 'Tags with malformed n
+amed entities';

#print Dumper \@names;
[download]

In reply to Re: Regex optimization: Can (?> ) and minimal match help here? by moritz
in thread Regex optimization: Can (?> ) and minimal match help here? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.