I think $attValue could safely use non-backtracking groups. I think that ($attribute*?) really should be ($attribute*), because if you have multiple attributes, the RE engine first try to match zero, then one, then two etc., instead of matching them all in the first place.
(.*) # what's between tags, even newline

Are you sure you want that greedy, and with no further restrictions? Maybe it's appropriate in your case, but in the general case it's not good when parsing XML ;-)

Shortly after the release of perl 5.10.0 I wrote an "XML" regex, mostly as an exercise for the cool new regex features. It parses only a small subset of XML, but maybe it's of use to you:

#!/usr/bin/env perl5.10.0 use strict; use warnings; use re 'eval'; use Test::More qw(no_plan); use 5.010; use Data::Dumper; my $xml; my $nested_tags; my $cdata = qr{ (?> [^<>&"]+ # any amount of "normal" text | \&\w+; # named chars | \&\#\d+; # numbered codepoints )}x; $xml = qr{ (?> (??{$nested_tags}) | $cdata)+ }x; #$xml = qr/ $single_xml+/; my $name = qr{ (?>\w+(?: [:-]\w+)*) }x; my $attribute = qr{ (?>$name="$cdata*+") }x; { $nested_tags = qr{ (?<nested_tags> < ($name) # (?{print "after <$^N: \n"}) # (?{print "match: [$&] (\$2:$2) \n"}) (?>\s+$attribute)*\s* (?: /\s*> # either an empty tag end ... | > # or end-of-tag and (?> (?&nested_tags) | $cdata)*+ # arbitrary XML </\s* (??{$2})\s*> # and a closing tag containing # the current name ) ) }x; } like "foo bar baz", qr/^$cdata$/, "cdata"; unlike "<bla>", qr/^$cdata$/, "cdata"; like 'blerk="foo"', qr/^$attribute$/, "simple attribute"; unlike 'blerk=bar', qr/^$attribute$/, "non-quoted attribute"; like '<bla />', qr/^$nested_tags$/, "single, empty XML tag +"; unlike '<bla>', qr/^$nested_tags$/, "single, non-empty XML + tag"; like '<bla></bla>', qr/^$nested_tags$/, "single, closed XML ta +g"; like '<bla><blubb/></bla>', qr/^$nested_tags$/, "nested tags 1"; like '<bla>foo</bla>', qr/^$nested_tags$/, "nested tags 2"; unlike '<bla><blubb></bla>',qr/^$nested_tags$/, "nested tags 3"; like '<bla><blubb></blubb></bla>', qr/^$nested_tags$/, "nested tags 4"; like '<moep><blubb></blubb></moep><foo/><bar></bar>', qr/^$xml+$/, 'multiple nested tags'; unlike '<bla><blubb></foo></bla>', qr/^$xml+$/, "wrongly nested tags"; like '<bla>foo</bla>', qr/^$xml+$/, "nested tags with cdata"; like '<bla foo="bar" />', qr/^$nested_tags$/, 'Tag with attribute'; like '<bla>foo &auml;blubb</bla>', qr/^$nested_tags$/, 'Tags with named entit +ies'; unlike '<bla>foo &aumlblubb</bla>', qr/^$nested_tags$/, 'Tags with malformed n +amed entities'; #print Dumper \@names;

In reply to Re: Regex optimization: Can (?> ) and minimal match help here? by moritz
in thread Regex optimization: Can (?> ) and minimal match help here? by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.