Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have been searching for an answer to the following problem: Given a regexp and the maximal length of the string results know all possible outputs. I am concretely thingking about a method call that would expand into an array of possible strings of charaters. It could be mathematically demostrated that given the possible maximal length of the result strings, there is a finite number of them. Intuition could be squarely wrong, but I have the very strong feeling that there should be something like that. Well, maybe I am getting to these issues from another perspective and I am wishfully thinking (and lazzyly wanting :-), for there to be something like that. Again, I could be basically wrong in my approach.
  • Comment on expanding regexps, question with no answer yet

Replies are listed 'Best First'.
Re: expanding regexps, question with no answer yet
by japhy (Canon) on Aug 17, 2001 at 19:01 UTC
    Mark-Jason Dominus wrote a regex-to-text-stream program in ML, and I wrote one in Perl. The process is not terribly difficult, but it can become difficult if allow things like look-ahead and look-behind and deferred-evaluation assertions.

    Basically, you must parse a regex into its parts. Then work along these parts, one by one, building a string.

    /^\w+\s+(\d{3}|\w+)/ \w+ => "j30_98a3" \s+ => "\n\n\t \r" \d{3} => "523" (or) \w+ => "j30_98a3" \s+ => "\n\n\t \r" \w+ => "r4_QK4"
    You basically produce a string by processing the NODE and its QUANTIFIER. I might produce a module to do this in the upcoming rewrite of my regex parser, to appear in the Regexp::Parser hierarchy. Or perhaps Regex::Parser. I don't know.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      I have an interest in this as well. I also wrote a set of (experimental) modules that do just what you mention here.

      If you're interested I can let you have a copy, and maybe you can replace my rather crude regex parser with something more elegant.

      Yves / DeMerphq
      --
      When to use Prototypes?

Re: expanding regexps, question with no answer yet
by trantor (Chaplain) on Aug 17, 2001 at 18:32 UTC

    Poor man's answer:

    it is possible indeed, and the demonstration is quite simple:
    1. Given the maximum length of the string, generate every possible string and match against the regexp. It's obviously a finite number, possibly huge but finite.
    2. If the string matches, push it into an array which is initially empty. A number less than a finite number (or equal) is certainly finite.

    Now this is purely theoretical and ridiculously inefficient, but it can be done quite easily as a proof of concept if you like.

    Be careful when you generate the possible strings (if you're really going to), Unicode is out there!

    Another much more reasonable approach would be studying the Perl code that generates the Finite State Machine used to match a certain regexp and use that as a starting point for generating strings instead of matching them. It wouldn't probably be too bad if backtracking and other fancy features weren't there.

    -- TMTOWTDI

Re: expanding regexps, question with no answer yet
by Hofmator (Curate) on Aug 17, 2001 at 18:49 UTC

    I don't know of any module which does that - for a very simple reason, the number of possibilites quickly explode to very large numbers.

    Let's have a closer look - disregarding any Unicode issues, that would render things immediately impossible - at e.g. this regex /a/. So we are looking for a string that contains the letter 'a'. Let the maxlen be 3, a very short string.

    We start out with all possible three letter strings. Every byte of the string can have (in principle) 256 different values. This gives 256**3 = 16,777,216 different strings with 3 bytes each. In turn this eats up at least (in reality it's more) 50,331,648 bytes of memory, that's nearly 50 MB! For a three letter string. Furthermore the amount of strings you will get back is huge as well - and what would you do with them ...

    Ok, the problem itself is still interesting but you have to turn it down a bit - then it makes a very nice programming problem:

    • take a very small source alphabet, so instead of allowing all 256 values for each character, restrict yourself to numbers from 0..9 or capital letters 'A'..'Z'
    • choose only very small maxlength for the string

    So go ahead and try to write the little program - and feel free to ask here if you come across some problems.

    -- Hofmator

      Thanks to magick autoincrement, there's no need to burn ram, at least on the candidate set. Of course it'll still take forver and two years on strings of any particularly interesting length.

      'The fickle fascination of and Everlasting God' - Billy Corgan, The Smashing Pumpkins

        Well, the magic autoincrement won't help too much, as it doesn't iterate over the whole range of 256 characters. Still, you are correct, you don't have to create all the possible strings in memory. Nevertheless with a regex like /./ you get to keep all possibilites so the memory problem remains. Not to mention the time issue ...

        -- Hofmator