demerphq has asked for the wisdom of the Perl Monks concerning the following question:

I've been wondering about something in perls regexes. If we have a sub pattern in a regex like: (LIST|OF|LISTS) can anybody think of a circumstance where "LISTS" will ever match? I've been pondering this for a while, and I cant think of any reason why one cannot consider the former to be functionally equivelent to (LIST|OF). Any ideas?

BTW, I am aware that (LISTS|OF|LIST) will produce very different results from (LIST|OF|LISTS), the point is to determine whether its fair to say that the latter contains a sub pattern that will NEVER match so that possibly a warning could be generated, or possibly the regex engine could assume its an error and silently assume it should be treated as though it were spelled (LISTS|OF|LIST).

A last point of interest, the two constructs (LISTS|OF|LIST) and (LIST|OF|LISTS) would be equivelent in many regex implementations. Perls particular implementation however does not see them as the same at all.

---
demerphq

  • Comment on Superfluous subpatterns in regex alternations.

Replies are listed 'Best First'.
Re: Superfluous options in regex alternations.
by gaal (Parson) on Jan 03, 2005 at 12:27 UTC
    If you have a *sub* pattern, then yes, here's when LISTS can match:

    perl -le '"LISTS" =~ /(LIST|OF|LISTS)$/ and print "matched $1"'

Re: Superfluous options in regex alternations.
by Aristotle (Chancellor) on Jan 03, 2005 at 12:33 UTC

    That option can match any time the regex engine has picked the first one but is forced to backtrack into the alternation. gaal gave you one example. Here are some artificial ones where the LIST option will never match but the LISTS one may:

    / (LIST|OF|LISTS) (?!S) /x / (LIST|OF|LISTS) (?(?{ $+ eq "LIST" }) (?!) ) /x

    Makeshifts last the longest.

      Ok, I must have been suffering from a mental block on this one. I dont know how I didnt come up with these (and gaals).

      Just for the record im trying to come up with a way to make Perls regex engine optimize subpatterns of the type (LIST|OF|LITERAL|STRINGS), I had overlooked the implications of backtracking for this, and you've now set me straight. Thanks.

      ---
      demerphq

Re: Superfluous subpatterns in regex alternations.
by RMGir (Prior) on Jan 03, 2005 at 12:45 UTC
    Good point. As long as it's the only clause in the regex, you're right that they'd be equivalent.

    But to issue the warning you want, the regex engine would have to post-process the regex, to make sure it IS equivalent, and no one slipped in anything after the closing ).

    And even then, how would you deal with qr//?

    Given:

    $re=qr/(LIST|OF|LISTS)/;
    you couldn't optimize $re to /(LIST|OF)/, since you don't know where it will be used, right?

    Mike

      Yes, qr// does provide some problems here. But actually post processing the regex isnt a problem, as in principle thats exactly what I want to do. The objective is to turn patterns of the type (LIST|OF|LISTS) into a trie or DFA depending on the circumstances. If this is to be a silent optimisation then scanning the regex once its compiled will become necessary to identify points in the pattern where its possible. OTOH, introducing a new (?...:) type would reduce the costs involved, but would only make the optimisations available to new code. But this thread has given me lots of angles to consider. Thanks all.

      ---
      demerphq

Re: Superfluous options in regex alternations.
by holli (Abbot) on Jan 03, 2005 at 12:34 UTC
    my $t = "my LISTS are always long"; $t =~ /(LIST|OF|LISTS) /; print $1; #LISTS
    my $t = "my LISTS are always long"; $t =~ /((LIST|OF)S)/; print $1; #LISTS