Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Reverse engineering regular expressions

by paulski (Beadle)
on Aug 01, 2005 at 00:53 UTC ( #479762=perlquestion: print w/replies, xml ) Need Help??

paulski has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I've got a problem I'm hoping someone's already solved or someone can point me in the direction of some possible solutions.

I need to write some PERL code that will reverse engineer a PERL regexp. That is, given a regular expression:



The code will generate a string that will match this regular expression.

e.g. code might generate the string.

"This is a test."

Has anyone looked at this type of thing before?



  • Comment on Reverse engineering regular expressions

Replies are listed 'Best First'.
Re: Reverse engineering regular expressions
by Joost (Canon) on Aug 01, 2005 at 01:15 UTC
    Interesting problem, but I wonder why you would want to solve it.

    If you "just" want something to help you explain a regex, Regex Coach is helpful.

    If you want to support the perl code-extensions - (?{ ... }) and friends - you can't. Update: after some more reading, lookaheads are tricky too, you're probably restricted to "classical" regular expressions + a few extensions.

    Generating a "minimal" string that would match a regex (i.e. drop everything with a * modifier, drop everything after a |, take the first character of a characterset etc) should not be too hard (famous last words...) as long as you get the parser right. I suggest you take an existing one like YAPE::Regex or Regexp::Parser.

    Oh, and while I was looking for those modules on cpan, I also found Regexp::Genex - it seems to do what you want. :-)

Re: Reverse engineering regular expressions
by Enlil (Parson) on Aug 01, 2005 at 01:04 UTC
    As least one person has.

    ..and the question does come up here everyonce in a while as well.


Re: Reverse engineering regular expressions
by planetscape (Chancellor) on Aug 01, 2005 at 07:55 UTC

    Perhaps what you really want is something on the order of YAPE::Regex::Explain.

    The regular expression: (?-imsx:(.*)test(.*)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- test 'test' ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Re: Reverse engineering regular expressions
by GrandFather (Saint) on Aug 01, 2005 at 01:03 UTC

    For simple regex's that is easy. For some it is impossible. For example consider:

    /$interpolated/ /(?(?{rand})this|that)/

    Perl is Huffman encoded by design.
Re: Reverse engineering regular expressions
by davido (Cardinal) on Aug 01, 2005 at 04:28 UTC

    Consider the following regular expression:


    In a purely ASCII environment (without UTF or Unicode character sets) this will match exactly 255 different strings (it would be 256 but \n won't match). For example, it will match:

    • testa
    • testb
    • testc
    • test1
    • test2
    • test$
    • test%
    • .......and so on....

    But now look at your regular expression. With no other criteria beyond m/(.*)test(.*)/ you have a hopelessly large number of possibilities. Minimally, it could match "test". But it could also match any string of any size (up to the capacity of your computer's memory and swapfile) as long as the sequence 'test' is found somewhere within the string. In fact, given a random string of random length, the likelyhood of finding a match increases as the string grows, for in an infinite sequence of random characters there will exist an infinate number of embeded 'test' sequences.

    The point is, there's no point to trying to generate every (or any) string that will match a given regular expression if that RE doesn't somehow limit the number of possibilities to be a manageable quantity.


Re: Reverse engineering regular expressions
by polypompholyx (Chaplain) on Aug 01, 2005 at 09:52 UTC

    Dominus's Higher Order Perl has a chapter on generating strings from simple regex-like specifications, but the online version isn't quite online yet.

    And, since everyone is thinking it, but no-one has said it yet: perl is a binary, Perl is a language, PERL is a typo...

Re: Reverse engineering regular expressions
by bsb (Priest) on Aug 01, 2005 at 12:47 UTC
    Generating regex strings with a regex

    Or with Rexexp::Genex:

    perl -MRegexp::Genex=:all -le 'print for strings(qr/(.*)test(.*)/)' 3'e3''test 3'e3'test~ 3'e3'test 3'e3test ╠ 3'e3test  3'e3test 3'etest 3'etest 3'etest 3'etest 3'test╠e  3'test╠e 3'test╠e 3'test╠ 3'test 3test~ 3test test

      But what has that gotten you? At best, a contrived minimal slice of an infinite set. Where, for example, is "asdf;laksjfd;alksjaewraefsdtest134qwefalskdfjaeraf;" (another minimal slice of infinity).


        I didn't have time to wait for that one. :)

        You are right, of course. However, sometimes a slice of an infinite set is enough, say for testing or debugging. Also, regexes with only small characater classes and finite quantifiers do produce useful and complete results.

        Dominus' book gives you an infinite stream of the matching strings, ordered by length. Still quite a wait in your case.

Re: Reverse engineering regular expressions
by onegative (Scribe) on Aug 01, 2005 at 17:48 UTC
    I believe many have touched on the fact that infinite possibilities exist for any given situation unless you already had some notion of what you expected in the first place. I use the following tool to help with tricky regex expressions, not because I can't do them myself, but rather to help speed up the process.
      I found a better module on CPAN that does exactly what I need Thanks for all the input guys.

      The problem I'm having now is how to get a regexp into this function. I wrote a stub progam to test.

      #!/usr/bin/perl -w
      use strict;
      use Parse::RandGen::Regexp;
      my $regexp = "/^STOR\s^\n{100}/smi";
      my $r = Parse::RandGen::Regexp->new($regexp);
      my $string = $r->pick(match=>1, captures=>{}); 
      print("\$string: $string\n");
      This throws the following error.
      Unrecognized escape \s passed through at ./ line 6.
      %Error:  Parse::RandGen::Regexp has an element that is not a Regexp reference (ref="")! at /usr/lib/perl5/site_perl/5.8.6/Parse/RandGen/ line 36
      	Parse::RandGen::Regexp::_newDerived('Parse::RandGen::Regexp=HASH(0x9dcdd88)', 'HASH(0x9e5f138)') called at /usr/lib/perl5/site_perl/5.8.6/Parse/RandGen/ line 81
      	Parse::RandGen::Condition::new('Parse::RandGen::Regexp', '/^STORs^\x{a}{100}/smi') called at ./ line 7
      Now I need the string in regexp format to pass to the function. I could just put the string in qr//s but in my real program I need to read the regexps from a list so they will come in scalar format.

      i.e. How do I convert:




      I'm not sure how to do this conversion.



        This is a duplicate of my answer to Parse::RandGen::Regexp

        As the qr// method runs in interpolative context one can do

        use strict; my $match='test'; my $regex_match=qr/$match/i; my $test_value='This is a Test'; print 'It matches' if $test_value=~m/$regex_match/;


        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://479762]
Approved by GrandFather
Front-paged by johnnywang
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2022-11-29 00:19 GMT
Find Nodes?
    Voting Booth?