Re^2: Interesting Perl/Java regexp benchmarking (capturing vs. grouping)

Thanks for taking the time to reformat this. It highlights what I already thought when I glanced at the regexp in the article. I think the writer is confusing capturing and grouping.

I suspect that all the sub-captures for picking up the tricky Unicode stuff are not necessary. In fact, I don't think the alternations for picking up element begin and end tags need to captured either.

They should all be grouped (read: (?:...)) and a single capturing paren around the whole mess. You could then walk down the stream with a while:

  my $token;
  while( $token = $stream =~ /(
       <[^/]([^>]*[^/>])?>
     | </[^>]*>
     | <[^>]*/>
     | (?: ... ) # unicode goop
  )/gx ) {
     print $token;
  }
[download]

Making all those subcaptures available in $1, $2, $3... takes a significant amount of time which could account for Perl's poor showing. (Disclaimer: I have no idea whether Java makes the same distinction between capturing and grouping. If it does so, it's expending as much effort as Perl and my reasoning would be incorrect).

- another intruder with the mooring of the heat of the Perl

Comment on Re^2: Interesting Perl/Java regexp benchmarking (capturing vs. grouping) Select or Download Code