comment on

not sure how the optimizer optimizes so I am going to pretend it is not there

In general, every regexp should match exactly the same with the optimiser as without it, so this is usually the right thing to do.

If you want to delve more deeply ...

the first port of call should be the debug output, which you can get by supplying the -Dr switch to perl (available only if perl has been compiled with DEBUGGING enabled):

  perl -Dr myprog
[download]

or by using the re module:

  use re qw{ debug };
[download]

which uses some very evil cleverness to act as if you has compiled perl with DEBUGGING in the first place.

Taking the code from the original post as an example, the output from my installed perl looks like this:

Compiling REx `(un-.*)\@(.*)'
size 17 Got 140 bytes for offset annotations.
first at 3
rarest char @ at 0
rarest char - at 2
   1: OPEN1(3)
   3:   EXACT <un->(5)
   5:   STAR(7)
   6:     REG_ANY(0)
   7: CLOSE1(9)
   9: EXACT <@>(11)
  11: OPEN2(13)
  13:   STAR(15)
  14:     REG_ANY(0)
  15: CLOSE2(17)
  17: END(0)
anchored `un-' at 0 floating `@' at 3..2147483647 (checking anchored) 
+minlen 4 
Offsets: [17]
        1[1] 0[0] 2[3] 0[0] 6[1] 5[1] 7[1] 0[0] 8[135549512] 0[0] 10[1
+] 0[0] 12[1] 11[1] 13[1] 0[0] 14[0] 
Omitting $` $& $' support.

EXECUTING...
[snip]
[download]

There are no options to tailor the output, so you simply have to ignore the bits you don't want. Let me try to explain what each is saying though:

Compiling REx `(un-.*)\@(.*)'

This is the regexp we are compiling.

size 17 Got 140 bytes for offset annotations.

The internal compiled form of the regexp uses 17 slots. This is rarely relevant, except that if the count exceeds 32768 (2^15), references forwards and backwards within the compiled form need to use a 4-byte offset instead of a 2-byte offset, which slows things down a bit. Note that the original code by Henry Spencer (may his tribe increase) was limited to 2^15 nodes.

See below for the meaning of "offset annotations".

first at 3
rarest char @ at 0
rarest char - at 2
[download]

This is debug information from the optimisation phase. I'm not entirely sure what "first at 3" means: I think it means that this is the offset (ie minimum length: see below) at which the first "expensive" aspect of the pattern appears, which means it is worth looking for ways to optimise things.

If the pattern is expensive (and almost everything counts as expensive if you wouldn't have been better off using index() in the first place), the engine looks for three things it can use to optimize with: a fixed substring, a floating substring, and an initial class. I'll describe them in more detail later, but at this point we've found the first two, and so we analyse them to determine the "rarest" character (ie the character in those strings least likely to appear in normal text). That's why we got those two "rarest char" messages.

   1: OPEN1(3)
[...]
  17: END(0)
[download]

This is a text representation of the compiled form of the regexp. These are the actual nodes that will be traversed when trying to match against a string. Most of them are fairly obvious, except the special ones introduced by the optimisation phase such as CURLYM and CURLYN. But that isn't what we're here to see.

anchored `un-' at 0 floating `@' at 3..2147483647 (checking anchored) minlen 4

This one line is what we are here to see: it is the meat of what the optimiser has found. The most important thing here (for optimisation purposes) is at the end: "minlen 4" means the matcher can bail out immediately when given a string shorter than 4 characters, because no shorter string can possibly match.

The rest of the line describes the optimizer substrings I mentioned earlier: we have one substring "un-" at a fixed location (offset 0 in this case, meaning that if must appear at offset 0 of the match rather than of the string - note that the regexp is not anchored). The floating substring is the single character "@", which must appear somewhere from offsets 3 to infinity (2147483647 is 2^31-1, but the value is special-cased to do the right thing even on very long strings). And it also tells us that it prefers to check the fixed substring first ("checking anchored").

In some cases, we also get a "start class" ("stclass"), ie a character class that the first matched character must match. For example if I replace the initial 'u' in the pattern with [Uu], I see this instead:

  anchored `n-' at 1 floating `@' at 3..2147483647 (checking anchored)
+ stclass `ANYOF[Uu]' minlen 4
[download]

and this class is checked after the substrings, as a last-ditch attempt to avoid entering the full regexp engine for this location in the string.

And note that this is the idea: these optimisations are there to try and avoid invoking the full majesty of the regexp engine for as many offsets in the string as possible. Their only purpose is to say "can't match here"; when they fail, the regexp engine is given the pattern and the first remaining offset, and is expected to try and match the full pattern at that point.

Offsets: [17]
        1[1] 0[0] 2[3] 0[0] 6[1] 5[1] 7[1] 0[0] 8[135549512] 0[0] 10[1
+] 0[0] 12[
+1] 11[1] 13[1] 0[0] 14[0]
[download]

This information about offsets relates the nodes in the compiled regexp to the offsets in the string of the pattern, and is available for use by things such as regexp debuggers (eg Komodo). It isn't intended to be useful to humans, unless they're debugging the offset information.

Omitting $` $& $' support.

The code doesn't reference these special variables, so the regexp engine doesn't need to slow things down by setting them up.

That's about it. The two functions that deal with all this - one to find the optimisation information, and one to use it - are without doubt the two pieces of code most likely to cause insanity and/or sleepless nights in 90% of all respondents that didn't run away when we asked.

Update 2006/04/06: a paragraph was truncated, I finished it off.

Hugo

In reply to Re: Re: regex search and replace globally by hv
in thread regex search and replace globally by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.