not sure how the optimizer optimizes so I am going to pretend it is not there
In general, every regexp should match exactly the same with the optimiser as without it, so this is usually the right thing to do.
If you want to delve more deeply ...
or by using the re module:perl -Dr myprog
which uses some very evil cleverness to act as if you has compiled perl with DEBUGGING in the first place.use re qw{ debug };
Taking the code from the original post as an example, the output from my installed perl looks like this:
Compiling REx `(un-.*)\@(.*)' size 17 Got 140 bytes for offset annotations. first at 3 rarest char @ at 0 rarest char - at 2 1: OPEN1(3) 3: EXACT <un->(5) 5: STAR(7) 6: REG_ANY(0) 7: CLOSE1(9) 9: EXACT <@>(11) 11: OPEN2(13) 13: STAR(15) 14: REG_ANY(0) 15: CLOSE2(17) 17: END(0) anchored `un-' at 0 floating `@' at 3..2147483647 (checking anchored) +minlen 4 Offsets: [17] 1[1] 0[0] 2[3] 0[0] 6[1] 5[1] 7[1] 0[0] 8[135549512] 0[0] 10[1 +] 0[0] 12[1] 11[1] 13[1] 0[0] 14[0] Omitting $` $& $' support. EXECUTING... [snip]
There are no options to tailor the output, so you simply have to ignore the bits you don't want. Let me try to explain what each is saying though:
Compiling REx `(un-.*)\@(.*)'This is the regexp we are compiling.
size 17 Got 140 bytes for offset annotations.The internal compiled form of the regexp uses 17 slots. This is rarely relevant, except that if the count exceeds 32768 (2^15), references forwards and backwards within the compiled form need to use a 4-byte offset instead of a 2-byte offset, which slows things down a bit. Note that the original code by Henry Spencer (may his tribe increase) was limited to 2^15 nodes.
See below for the meaning of "offset annotations".
first at 3 rarest char @ at 0 rarest char - at 2
This is debug information from the optimisation phase. I'm not entirely sure what "first at 3" means: I think it means that this is the offset (ie minimum length: see below) at which the first "expensive" aspect of the pattern appears, which means it is worth looking for ways to optimise things.
If the pattern is expensive (and almost everything counts as expensive if you wouldn't have been better off using index() in the first place), the engine looks for three things it can use to optimize with: a fixed substring, a floating substring, and an initial class. I'll describe them in more detail later, but at this point we've found the first two, and so we analyse them to determine the "rarest" character (ie the character in those strings least likely to appear in normal text). That's why we got those two "rarest char" messages.
1: OPEN1(3) [...] 17: END(0)
This is a text representation of the compiled form of the regexp. These are the actual nodes that will be traversed when trying to match against a string. Most of them are fairly obvious, except the special ones introduced by the optimisation phase such as CURLYM and CURLYN. But that isn't what we're here to see.
anchored `un-' at 0 floating `@' at 3..2147483647 (checking anchored) minlen 4This one line is what we are here to see: it is the meat of what the optimiser has found. The most important thing here (for optimisation purposes) is at the end: "minlen 4" means the matcher can bail out immediately when given a string shorter than 4 characters, because no shorter string can possibly match.
The rest of the line describes the optimizer substrings I mentioned earlier: we have one substring "un-" at a fixed location (offset 0 in this case, meaning that if must appear at offset 0 of the match rather than of the string - note that the regexp is not anchored). The floating substring is the single character "@", which must appear somewhere from offsets 3 to infinity (2147483647 is 2^31-1, but the value is special-cased to do the right thing even on very long strings). And it also tells us that it prefers to check the fixed substring first ("checking anchored").
In some cases, we also get a "start class" ("stclass"), ie a character class that the first matched character must match. For example if I replace the initial 'u' in the pattern with [Uu], I see this instead:
and this class is checked after the substrings, as a last-ditch attempt to avoid entering the full regexp engine for this location in the string.anchored `n-' at 1 floating `@' at 3..2147483647 (checking anchored) + stclass `ANYOF[Uu]' minlen 4
And note that this is the idea: these optimisations are there to try and avoid invoking the full majesty of the regexp engine for as many offsets in the string as possible. Their only purpose is to say "can't match here"; when they fail, the regexp engine is given the pattern and the first remaining offset, and is expected to try and match the full pattern at that point.
Offsets: [17] 1[1] 0[0] 2[3] 0[0] 6[1] 5[1] 7[1] 0[0] 8[135549512] 0[0] 10[1 +] 0[0] 12[ +1] 11[1] 13[1] 0[0] 14[0]
This information about offsets relates the nodes in the compiled regexp to the offsets in the string of the pattern, and is available for use by things such as regexp debuggers (eg Komodo). It isn't intended to be useful to humans, unless they're debugging the offset information.
Omitting $` $& $' support.The code doesn't reference these special variables, so the regexp engine doesn't need to slow things down by setting them up.
Update 2006/04/06: a paragraph was truncated, I finished it off.
Hugo
In reply to Re: Re: regex search and replace globally
by hv
in thread regex search and replace globally
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |