BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Anyone care to explain the construction of this regex.

Note: I know what it does. I am just puzzled as to the benefits of some elements of it's construction.

(?: (?i) (?: [+-]? ) (?: (?= [0123456789] | [.] ) (?: [0123456789]* ) (?: (?: [.] ) (?: [0123456789]{0,} ) )? ) (?: (?: [E] ) (?: (?: [+-]? ) (?: [0123456789]+ ) ) | ) )

Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re: Explain a regex
by Util (Priest) on Jan 22, 2005 at 19:28 UTC

    $RE{num}{real} is created by the real_creator subroutine in Regexp/Common/number.pm; this highly flexible subroutine builds a custom RE from parameters ($base, $places, $radix, $sep, $group, $expon). For example, the code that creates $RE{num}{real} is:

    pattern name => [qw (num real -base=10), '-places=0,', qw (-radix=[.] -sep= -group=3 -expon=E)], create => \&real_creator, ;

    Although I see a few places where special cases of sub-expressions could be recognized and automatically replaced with their simpler forms (e.g. {0,} becomes *, and base-16 [0123456789ABCDEF] becomes [0-9A-F]), I do not disagree with the module author's choice to leave those cases in their general form; the module code is clearer, and is less likely to produce incorrect REs, than if it included code to "tighten-up" the RE.

    In short, the RE is optimized for clarity and correctness in the generating code, rather than for clarity or conciseness in the RE itself.

      Good explanation. Thanks.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.
Re: Explain a regex
by merlyn (Sage) on Jan 22, 2005 at 15:57 UTC
    I don't understand why \d (or even the range 0-9) wasn't used instead of the large character class, or \. wasn't used instead of the dot character class, or why * was used in one place but {0,} was use in another.

    I'd say this was the work of someone who wasn't completely clued. I'd hate to see the rest of their code.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      !?

      use Regexp::Common qw[number];; print $RE{num}{real};; (?:(?i)(?:[+-]?)(?:(?=[0123456789]|[.])(?:[0123456789]*)(?:(?:[.])(?:[ +0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[0123456789]+))|))

      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.

        I think that Randal's point is that just because something is on CPAN, or even in the main perl distribution (which I don't think is the case for Regexp::Common), doesn't mean that it's the most optimal, clue-filled way to do it. ;-) For example, my modules on CPAN probably would not meet with Randal's full approval either ;-)

        The first line of your sig can be safely reversed here:

        Examine who speaks, not what is said.

Re: Explain a regex
by hv (Prior) on Jan 22, 2005 at 15:56 UTC

    I think most of the verbiage is there to keep the elements looking as similar to each other as possible, which is of dubious benefit.

    I'd guess that it's also trying to make it easy to modify it to extract any part of the number being matched by replacing the relevant (?: with (, which saves the programmer from having to find the place to insert the matching ).

    I can see no reason for the use of {0,} instead of * in one place.

    Hugo