ExReg has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to extend a tool that looks at files, searching for certain patterns. The tool is in perl and makes copious use of regular expressions. I came across a new pattern that has me scratching my head. I am trying to match a certain expression e that can have more than one definition ( shown in much simplified form ):

1. e =~ u 2. e =~ uae 3. e =~ edefe 4. e =~ ebe 5. e =~ ec

As you can see, all but the first are recursive matches. I could write the above as

e =~ u|uae|edefe|ebe|ec

Here is what I have so far: if I had only 1 and 2 from above, I could write the combination of 1 and 2 as

e =~ u(?:au)*

Likewise, for 1 and 3, I could write

e =~ udufu

For 1 and 4, I would have

e =~ ubu

and for 1 and 5, I would have

e =~ uc

To go further, I would have to combine more than two together, and I think it could get horrifyingly complex. Is there a way to tackle a problem like this with dynamic regexes, or is there a method for substituting the above together?

I have perl 5.6, so I am slightly limited on what I can do. I do not know dynamic regexes well enough to frame the above with those. I cannot use CPAN or anything other than the screwdriver I have.

Replies are listed 'Best First'.
Re: When the only tool you have is a screwdriver...
by tangent (Parson) on Oct 08, 2015 at 17:09 UTC
    Last night I was doing a bit of light reading which just might be of help to you. In the book Mastering Algorithms with Perl, Chapter 9: Strings, p.399, there is an example of developing a top-down parser to translate a query into a regular expression. For example, it will take a query like:
    abc and not (def or ghi)
    and turn it into:
    /abc/ && ! ( /def/ || /ghi/ )
    The example is very thorough, and the finished code is very simple, so if you can get hold of the book I am sure you could adapt it to your needs.
      I have the book right above me. And it is the one that I had not looked in. Fantastic! I think I will have to bury myself in it for a while and see what inspiration I can get.
Re: When the only tool you have is a screwdriver...
by Anonymous Monk on Oct 08, 2015 at 17:44 UTC

    Do you have a good set of test cases ?

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1144184 use strict; use warnings; sub e { /\Guc*/gc or return 0; # rule 1. & 5. /\Ga/gc and return e(); # rule 2. /\Gc+/gc; # rule 5. /\Gd/gc and return e() && /\Gf/gc && e(); # rule 3. /\Gb/gc and return e(); # rule 4. return 1; } for (qw( u udufu uau uu uc ubu ucbuc uaau ua uauau uccc )) { print e() && /\G\z/ ? "matches" : " fail", " $_\n"; }

    outputs:

    matches u matches udufu matches uau fail uu matches uc matches ubu matches ucbuc fail uaau fail ua matches uauau matches uccc

    Which of these is wrong, if any? Let's keep tweaking until it works :)

      Cleaning up (and making more clear) the parsing.

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1144184 use strict; use warnings; sub e { /\Gua/gc ? return e() : /\Gu/gc || return 0; 1 while (/\Gd/gc && e() && /\Gf/gc or /\Gb/gc) && e() or /\Gc/gc; return 1; } for (qw( u udufu uau uu uc ubu ucbuc uaau ua uauau uccc uauuau )) { print e() && /\G\z/ ? "matches" : " fail", " $_\n"; }

      Again, do you have a good set of test cases ?

        Adding random test cases found a weird frankenbug. Fixing backup seems to have fixed it. Do you have any good test cases ?

        #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1144184 use strict; use warnings; sub e { my $p; /\Gua/gc ? return e() : /\Gu/gc || return 0; 1 while /\Gc/gc or ($p = pos(), /\Gb/gc && e() || (pos() = $p, 0)) or ($p = pos(), /\Gd/gc && e() && /\Gf/gc && e() || (pos() = $p, 0)); return 1; } for (qw( u udufu uau uu uc ubu ucbuc uaau ua uauau uccc uauuau udufuaua uaua uauau uccbucdc )) { print e() && /\G\z/gc ? "matches" : " fail", " $_"; }
Re: When the only tool you have is a screwdriver...
by oiskuu (Hermit) on Oct 09, 2015 at 14:52 UTC

    e =~ u|uae|edefe|ebe|ec
    It should be doable in more up-to-date perl. The expression e must start with token u, so let's rewrite it using e => uE?
    e =~ u|uauE?|uE?defe|uE?be|uE?c
    E =~ auE?|E?defe|E?be|E?c
    It should be evident that E? simplifies as
    E =~ (au)*(defe|be|c)*

    Putting things together:

    #! /usr/bin/perl my $g3 = "( d(?1)f(?1) | b(?1) | c )"; # (defe|be|c) my $g2 = "( (?:au)* (?3)* )"; # E? my $g1 = "( u | uau(?2) | u(?2)(?3) )"; # e my $re = qr/ $g1 (?(DEFINE) $g2 $g3 ) /x; warn "$re\n"; /^$re$/ and print while <>;

    But of course, this won't work with perl5.6 ...

    Update. It looks this simplifies to just

    my $re = qr/(u(?:au)*(?:d(?-1)f(?-1)|b(?-1)|c)*)/;

Re: When the only tool you have is a screwdriver...
by locked_user sundialsvc4 (Abbot) on Oct 08, 2015 at 15:49 UTC

    It seems to me that you are going to have to approach this problem with algorithmic logic that goes beyond the simple use of regular expressions.   You are actually parsing this string, and therefore I would consider using a parse-engine ... say Parse::RecDescent a Perl front-end wrapper for a “real, and advanced,” non-recursive parser.   (This problem does not permit a recursive-descent solution ... see below.)

    From a grammar point-of-view, your input strings consist of “identifiers,” being fixed literal strings such as u, a, d, b, c, and productions, such as e.   Thus, a grammar for this .. abstractly speaking .. might go something like this:

    e ::= 'u' | 'u' 'a' <e> | <e> 'd' <e> 'f' <e> | <e> 'b' <e> | <e> 'c'
    ... with the fact that the last three rules are “head recursive being a very-serious problem that would immediately exclude the use of many parsers.   (If a production begins with an instance of itself, the parser falls down a rabbit-hole and never comes out, if it attempts a “depth-first” approach as many parsers do.   Some parsers such as Bison provide convoluted means for parsing head-recursive grammars but they are difficult compromises.

    It also seems to me that this definition, however expressed, will have a lot of potential for ambiguity:   there could be many valid parses of the same string.

    Certainly, the use of a parser algorithm is called-for here, of some sort.   Quietly give-up on the notion of trying to do this with regex calisthenics.   And, also, give up on the notion that you will be able to efficiently solve this with “only a screwdriver.”

        Yes, but your situation is not necessarily “balanced.”   In the left/right parenthesis example, once the right-paren is seen, the matter is resolved.   But in some of your examples, we see multiple occurrences of "e".   Thus, in the general case, more-than-one recursion might have to be entered, and then backed-out in what turns out to be a failed parse-branch.   I do not believe that you can expect the regex engine to be that smart, although other Monks may promptly jump-in and show that I am wrong.

        However ... the saving grace might be “your actual data.”   If, for example, the terminal-tokens always consist of single characters, some sort of one- or one-and-three- character lookahead might be sufficient to resolve ambiguity.   After all, you do not always have to write for the general case:   you merely have to write what gets the job done in your case.   (Yay!)

        Even so, I think that this requirement (IMHO) has stepped beyond what you can reasonably expect from regular-expressions alone:   it is, I think, “a parsing problem.”   (By that I mean, “some sort of heuristic will be required, on top of what regexes alone can give you.)   With good fortune, you will find that there is an existing implementation of a heuristic (a parser) that will do the job.   Without good fortune, you will hack.   ;-)

      You are right. It really is a parsing problem. I have been looking forlornly at things like RecDescent, wishing I had it on the system in question. I will probably flog this dead horse a bit longer before I have to tell them the that handle of the screwdriver doesn't pound to well and that I need a hammer also.

        No need for a hammer. See "sub e" in 1144217 . It's only three lines long :)

        The first line looks for what starts an "e".
        The second line looks for what can follow an "e" and still be an "e".
        The third line indicates success :)

        What could be simpler ? (hehehe)