Kafka has asked for the wisdom of the Perl Monks concerning the following question:

Hi Dear monks,

I am writing a script that will get inputs from the user on regex format.
I would like to "lint" the input so if a new regex given is already covered by another it will issue a warning.
For example if in the list there are the following items:

abc.*f hello world\d 1234
And the user will add
abcd.*f
The program will issue a warning saying:
'abcd.*f' is already covered by 'abc.*f'.
Also if the user will enter:
^hello world\d
It will issue a warning

How can I do this?

Replies are listed 'Best First'.
Re: Is regex 1 covered by regex 2
by choroba (Cardinal) on Apr 29, 2015 at 10:14 UTC
    You might benefit from reading chapter 6.5 Regex String Generation from the "Higher Order Perl" by Mark Jason Dominus (available online at http://hop.perl.plover.com/book/).

    Update:Fixed the author.

    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Is regex 1 covered by regex 2
by hdb (Monsignor) on Apr 29, 2015 at 10:58 UTC

    There is probably no simple solution as the regexes can look very different but mean (nearly the same), like /\d/ and /[0-9]/. You could check for the similarity of the regexes when seeing them as strings:

    use strict; use warnings; use Text::Levenshtein 'distance'; my @old = ( 'abc.*f', 'hello world\d', '1234' ); my $new = 'abcd.*f'; for (@old) { print "Warning: $new looks like $_\n" if distance( $new, $_ ) < 6; }

    There are a lot of possible measures of distance around.

      Your observation that /\d/ and /[0-9]/ are (edit-insert: s/equivalent/loosely equivalent) loosely equivalent in regexen

      (see Jenda's Re^3: Is regex 1 covered by regex 2;
      note also that "equivalent" was my word choice; hdb's phrasing was clearly better)

      is precisely why "similarity... as strings" (as judged by Text::Levenshtein or any package based on a minimally variant definition) is irrelevant in light of OP's specific objective of identifying equivalent regexen ("if a new regex given is already covered by another".

      From perldoc Text::Levenshtein:

      This module implements the Levenshtein edit distance, which measures the difference between two strings, in terms of the *edit distance*. This distance is the number of substitutions, deletions or insertions ("edits") needed to transform one string into the other one (and vice versa)....

      I suspect (emphasis "suspect"; but lack the time just now to test the suspicion) that an approach with some chance of success in OP's terms would involve using the regex engine itself (but NOT by testing variants against identical data... where there would be far too many un-covered possibilities; edge cases and other instances where having two regexen match a particular text would lack rigor.

      A study of the code used in various regex testers or tutors might be profitable.

        The two, \d and [0-9], are not equivalent unless you use the /a modifier. By someone's IMNSHO incorrect decision, \d was implemented to mean "anything that might be considered a digit in languages you will never hear of", not "number understood by Perl and usable in computation". While the likelihood that you end up with a string containing any such characters is pretty slim, you should play safe and either use [0-9] or the /a modifier!

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.

        I fully support your comments. It is really up to Kafka to see whether such a simple approach would be sufficient. This probably also depends on the sophistication he can expect from his users (the more sophisticated they are the more likely this approach will not work). Whether or not testing against data is useful, would also depend on the context. If the data is from a limited domain, it could work. My suspicion though is, that he wants to do the checks to avoid unnecessary matching against data...

Re: Is regex 1 covered by regex 2
by Anonymous Monk on Apr 29, 2015 at 10:16 UTC
      davido's precious site to test regular expressions is now here

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Is regex 1 covered by regex 2
by jeffa (Bishop) on Apr 29, 2015 at 16:10 UTC

    "... if a new regex given is already covered by another ..."

    You left this requirement open to ambiguous interpretation. Do you want an exact match or do you want fuzzy matching? For example: abc.*f will match all the things that abc\wf will match, but not vice versa. So if abc.*f is already issued and the user tries to issue abc\wf, should your program issue a warning?

    In the off-change that it should not ... then all you need to is store each submitted value from user input into a hash as a key and increment the value for that key each time the key is seen:

    my %seen; while (chomp(my $input = <STDIN>)) { exit unless $input; $seen{$input}++; warn "you already issued $input\n" if $seen{$input} > 1; print "issued: ", join(', ', sort keys %seen ) ,$/; }

    If this is correct, then the fact that you are storing regular expressions is misleading -- you really just want to store strings that will be used as regular expressions later. Please clarify. :)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Hi
      In the case you gave abc.*f covers all the strings that abc\wf does, so a warning should be issued.
Re: Is regex 1 covered by regex 2
by Laurent_R (Canon) on Apr 29, 2015 at 18:06 UTC
    Hum, a very limited try at using the regex engine to simplify the string entered by the user and see if it can be matched by an existing regex. The idea is to remove any occurrence of .*, to replace (.)+ by a single occurrence of the letter matched by (.), etc.

    This is very brief attempt under the debugger:

    $ perl -dE 42 Loading DB routines from perl5db.pl version 1.33 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(-e:1): 42 DB<1> $regex = qr/abc.*f/; DB<2> $new = "abcd.*f"; DB<3> $new =~ s/.\*//g; DB<4> $new =~s/(.)\+/$1/g; DB<5> p $new; abcdf DB<6> print "true" if $new =~ /$regex/; true DB<7>
    I do not know whether this approach can lead to something interesting, but it would be very hard to make it error proof.

    Update 18:13 UTC: fixed a typo in line DB<4>.

    Je suis Charlie.
Re: Is regex 1 covered by regex 2
by MidLifeXis (Monsignor) on Apr 29, 2015 at 12:29 UTC

    ++ and Front-paged not so much for this particular question, but for the discussion on the general concept: identifying sub-grammars within a larger grammar. A topic like this tickles my CS fancy, and I am looking forward to the discussion it could generate.

    --MidLifeXis