Re: expanding regexps, question with no answer yet
by japhy (Canon) on Aug 17, 2001 at 19:01 UTC
|
Mark-Jason Dominus wrote a regex-to-text-stream program in ML, and I wrote one in Perl. The process is not terribly difficult, but it can become difficult if allow things like look-ahead and look-behind and deferred-evaluation assertions.
Basically, you must parse a regex into its parts. Then work along these parts, one by one, building a string.
/^\w+\s+(\d{3}|\w+)/
\w+ => "j30_98a3"
\s+ => "\n\n\t \r"
\d{3} => "523"
(or)
\w+ => "j30_98a3"
\s+ => "\n\n\t \r"
\w+ => "r4_QK4"
You basically produce a string by processing the NODE and its QUANTIFIER. I might produce a module to do this in the upcoming rewrite of my regex parser, to appear in the Regexp::Parser hierarchy. Or perhaps Regex::Parser. I don't know.
_____________________________________________________
Jeff[japhy]Pinyan:
Perl,
regex,
and perl
hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??; | [reply] [d/l] |
|
|
I have an interest in this as well. I also wrote a set of (experimental) modules that do just what you mention here.
If you're interested I can let you have a copy, and maybe you can replace my rather crude regex parser with something more elegant.
Yves / DeMerphq
--
When to use Prototypes?
| [reply] |
Re: expanding regexps, question with no answer yet
by trantor (Chaplain) on Aug 17, 2001 at 18:32 UTC
|
Poor man's answer:
it is possible indeed, and the demonstration is quite
simple:
- Given the maximum length of the string, generate
every possible string and match against the regexp. It's obviously
a finite number, possibly huge but finite.
- If the string matches, push it into an array which is initially
empty. A number less than a finite number (or equal) is certainly finite.
Now this is purely theoretical and ridiculously inefficient, but it can be done quite easily as a proof of concept if you like.
Be careful when you generate the possible strings (if you're really going to), Unicode is out there!
Another much more reasonable approach would be studying the Perl code that
generates the Finite State Machine used to match a certain
regexp and use that as a starting point for generating
strings instead of matching them. It wouldn't probably
be too bad if backtracking and other fancy features weren't there.
-- TMTOWTDI
| [reply] |
Re: expanding regexps, question with no answer yet
by Hofmator (Curate) on Aug 17, 2001 at 18:49 UTC
|
I don't know of any module which does that - for a very simple reason, the number of possibilites quickly explode to very large numbers.
Let's have a closer look - disregarding any Unicode issues, that would render things immediately impossible - at e.g. this regex /a/. So we are looking for a string that contains the letter 'a'. Let the maxlen be 3, a very short string.
We start out with all possible three letter strings.
Every byte of the string can have (in principle) 256 different values. This gives 256**3 = 16,777,216 different strings with 3 bytes each. In turn this eats up at least (in reality it's more) 50,331,648 bytes of memory, that's nearly 50 MB! For a three letter string. Furthermore the amount of strings you will get back is huge as well - and what would you do with them ...
Ok, the problem itself is still interesting but you have to turn it down a bit - then it makes a very nice programming problem:
- take a very small source alphabet, so instead of allowing all 256 values for each character, restrict yourself to numbers from 0..9 or capital letters 'A'..'Z'
- choose only very small maxlength for the string
So go ahead and try to write the little program - and feel free to ask here if you come across some problems.
-- Hofmator
| [reply] [d/l] |
|
|
Thanks to magick autoincrement, there's no need to burn ram, at least on the candidate set. Of course it'll still take forver and two years on strings of any particularly interesting length.
'The fickle fascination of and Everlasting God'
- Billy Corgan, The Smashing Pumpkins
| [reply] |
|
|
Well, the magic autoincrement won't help too much, as it doesn't iterate over the whole range of 256 characters. Still, you are correct, you don't have to create all the possible strings in memory. Nevertheless with a regex like /./ you get to keep all possibilites so the memory problem remains. Not to mention the time issue ...
-- Hofmator
| [reply] [d/l] |
|
|