Thanks for taking the time to reformat this. It highlights what I already thought when I glanced at the regexp in the article. I think the writer is confusing capturing and grouping.
I suspect that all the sub-captures for picking up the tricky Unicode stuff are not necessary. In fact, I don't think the alternations for picking up element begin and end tags need to captured either.
They should all be grouped (read: (?:...)) and a single capturing paren around the whole mess. You could then walk down the stream with a while:
my $token; while( $token = $stream =~ /( <[^/]([^>]*[^/>])?> | </[^>]*> | <[^>]*/> | (?: ... ) # unicode goop )/gx ) { print $token; }
Making all those subcaptures available in $1, $2, $3... takes a significant amount of time which could account for Perl's poor showing. (Disclaimer: I have no idea whether Java makes the same distinction between capturing and grouping. If it does so, it's expending as much effort as Perl and my reasoning would be incorrect).
- another intruder with the mooring of the heat of the Perl
In reply to Re^2: Interesting Perl/Java regexp benchmarking (capturing vs. grouping)
by grinder
in thread Interesting Perl/Java regexp benchmarking
by dreadpiratepeter
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |