IMPLEMENTATION

NAME

Issues with the Current Regular Expression Engine

SYNOPSIS

How the engine works, possible bugs, and requested modifications.

DESCRIPTION

Using `$&` and `$DIGIT`

If Perl sees you using $`, $&, or $' in your program anywhere, at compile-time, it sets a global boolean, PL_sawampersand, to TRUE. This means that all regexes will make these variables available upon success. They are made available by copying the string that was matched against to memory. The string is copied to rx->subbeg at the end of the pattern match, and the copying takes time and space. When $` is accessed, Perl goes through the following magic:

  /*
    sv_setpvn(SV *sv, char *ptr, STRLEN len)
    copies to the string slot of sv the first
    len characters of the string pointed to by
    ptr

    rx->subbeg is the string matched against,
    and is a char*, not a Perl SV

    rx->sublen is the length of that string

    rx->startp[] and rx->endp[] are like the
    @- and @+ arrays, which hold the offsets
    in rx->subbeg of $& and the $DIGIT vars
  */

  sv_setpvn(sv, rx->subbeg, rx->startp[0]);

When you access $&, it does:

  sv_setpvn(
    sv,
    rx->subbeg + rx->startp[0],
    rx->endp[0] - rx->startp[0]
  );

Wen you access $', it does:

  sv_setpvn(
    sv,
    rx->endp[0],
    rx->sublen - rx->endp[0]
  );

I hope this all makes sense.

The $DIGIT variables work in much the same way, except that a pattern match will only copy to rx->subbeg for a regex that uses capturing parentheses. In other words, $DIGIT variables incur the same penalty as $& and friends, but only on a per-regex basis whereas $& is a global penality. Here is the magic behind $1:

  /* $1 */
  sv_setpvn(
    sv,
    rx->startp[1],
    rx->endp[1] - rx->startp[1]
  );

and so on for $2, etc. Of course, there are safeguards to make sure $3 returns undef if the last successful regex didn't have that many parens, or if $3 was not set (due to alternation):

  "japhy" =~ /(..)(..)(?:(..)|(.))/;
  # $1 = 'ja', $2 = 'ph', $3 = undef, $4 = 'y'

  "japhy" =~ /(..)(..)/;
  # $1 = 'ja', $2 = 'ph'

The code in the Perl source is:

  if (
    paren <= rx->nparens &&
    (s1 = rx->startp[paren]) != -1 &&
    (t1 = rx->endp[paren]) != -1
  ) { /* it's a valid var */ }
  else {
    sv_setsv(sv, &PL_sv_undef);  /* set it to undef */
  }

The important point to see here is that, if you want access to these values, you must copy the string you matched. That cannot be avoided.

The Loophole

Well, there's one case where you can get away with it: global matching with in scalar context. ``When do you do that?'' you might ask. More often than you might expect. This ``trick'' only works when PL_sawampersand is false, by the way. If you've used <$&> or its friends, this loophole does not apply.

  # parsing a config file
  #   key = value
  while (/(\w+)\s*:\s*(.*)/g) {
    $config{$1} = $2;
  }

If your $_ there is thousands of bytes, you would have to be copying the entire string every time you matched, in order to get $1 and $2. But Perl makes an exception for you in this case. Global matching in scalar context does an evil trick -- instead of using savepvn() to copy the string being matched to rx->subbeg, it merely uses a pointer to the string itself! This is an effortless task, but potentially dangerous:

  $_ = "japhy";
  /(..)/;
  $_ = "tilly";
  print $1;  # 'ja'

  # with the /g flag
  $_ = "japhy";
  /(..)/g;
  $_ = "tilly";
  print $1;  # 'ti'

Whoa. That can be very bad. That bug used to also exist in the following situation:

  $_ = "japhy";
  () = /(..)/;
  $_ = "tilly";
  print $1;  # 'ti' until change 9018

That will be fixed in the next released version of Perl, if that patch is not already in 5.6.1.

Fixing the Hole, Keeping the Loop

Ben Tilly had a stunning idea: telling the pattern match to only copy from rx->startp[0] to rx->endp[0]! If you don't care about the prematch or postmatch portions, you would be able to save a lot of time and space.

Now, here's the problem. The current behavior is to copy all of the string, except in the case of this loophole. If we ``fix'' the loophole to copy the string, we will experience an incredible slow-down in many cases (which is bad). We could implement a pragma to make it faster, but this means we are changing the default behavior of Perl, which is generally considered a no-no.

The approach Ben Tilly has suggested is not difficult to implement.

IMPLEMENTATION

Here is the test program.

  1 if $&;
  $_ = "_" x 50_000;
  $n = times();
  1 while /(.)/g;
  printf "%.2f seconds\n", times() - $n;

And when run, it takes about 40 seconds on my machine:

  jefpin@towers [4:31pm] ~/perl-current #236> ./miniperl tilly-old
  38.88 seconds

Here is the test program with the HINT_NOPREPOST pragma on:

  BEGIN { $^H |= 0x00040000 }
  1 if $&;
  $_ = "_" x 50_000;
  $n = times();
  1 while /(.)/g;
  printf "%.2f seconds\n", times() - $n;

  jefpin@towers [4:31pm] ~/perl-current #237> ./miniperl tilly-new
  0.28 seconds

Damn that was fast. ;)

What my modification does is change the behavior from:

  /* save the string we matched and its length */
  I32 i = PL_regeol - startpos + (stringarg - strbeg);
  s = savepvn(strbeg, i);
  prog->subbeg = s;
  prog->sublen = i;

to:

  /* if we don't want $` and $' */
  if (flags & REXEC_NOPREPOST) {
    /* i is the length, and is from $&'s start to $&'s end */
    I32 i = prog->endp[0] - prog->startp[0];
    s = savepvn(strbeg + prog->startp[0], i);
    prog->subbeg = s;
    prog->sublen = -2;  /* -2 is a sentinel value */
  }

  /* otherwise, normal behavior */
  else {
    I32 i = PL_regeol - startpos + (stringarg - strbeg);
    s = savepvn(strbeg, i);
    prog->subbeg = s;
    prog->sublen = i;
  }

The sentinel value of -2 is then used in mg.c's code. You see, rx->subbeg represents $& now, but rx->startp[] and rx->endp[] holds offsets relative to the ENTIRE string that was matched. So when you fetch $1 or $&, if rx->sublen is -2, Perl subtracts rx->startp[0] from the offsets before it takes the substring.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Comment on Magick -- More $& Stuff (Especially for Tilly and Tye)