in reply to Re: Interesting Perl/Java regexp benchmarking
in thread Interesting Perl/Java regexp benchmarking

Thanks for taking the time to reformat this. It highlights what I already thought when I glanced at the regexp in the article. I think the writer is confusing capturing and grouping.

I suspect that all the sub-captures for picking up the tricky Unicode stuff are not necessary. In fact, I don't think the alternations for picking up element begin and end tags need to captured either.

They should all be grouped (read: (?:...)) and a single capturing paren around the whole mess. You could then walk down the stream with a while:

my $token; while( $token = $stream =~ /( <[^/]([^>]*[^/>])?> | </[^>]*> | <[^>]*/> | (?: ... ) # unicode goop )/gx ) { print $token; }

Making all those subcaptures available in $1, $2, $3... takes a significant amount of time which could account for Perl's poor showing. (Disclaimer: I have no idea whether Java makes the same distinction between capturing and grouping. If it does so, it's expending as much effort as Perl and my reasoning would be incorrect).

- another intruder with the mooring of the heat of the Perl