The Dream of a New, Faster Regular Expression Engine

Replies are listed 'Best First'.
Re: The Dream of a New, Faster Regular Expression Engine by diotalevi (Canon) on Oct 05, 2004 at 15:42 UTC
You may want to take a look at Regexp::Approx - Use fuzzy regular expressions. It embeds String::Approx function calls into the standard regular expression engine. It would be valuable to have a proper approximate matching engine. So please, don't let this code lull you into thinking that the more general solution wouldn't be immensely preferred.	[reply]
Re: The Dream of a New, Faster Regular Expression Engine by hardburn (Abbot) on Oct 05, 2004 at 14:46 UTC
If you want to build this into a module, you might want to read MJD's How Regexes Work. I hope you have some pennies around. "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.	[reply]
Re^2: The Dream of a New, Faster Regular Expression Engine by monsieur_champs (Curate) on Oct 05, 2004 at 15:28 UTC
I'm thinking seriously about this, for my MSc degree. But this stills under evaluation. I'm talking about this with an old friend and graduation teacher. Maybe this ends on a nice and good project. -- monsieur_champs	[reply]
Re: The Dream of a New, Faster Regular Expression Engine by Anonymous Monk on Oct 05, 2004 at 14:51 UTC
I've no idea what you mean by "to match hundreds of regular expressions against one text in O(lg(x))". What is x? It surely isn't the length of the text, or the length or number of matches you want to perform, as you will need at least linear time to read in the text and the matches. The abstract you link to doesn't mention this time either. From what I understand of the abstract, this is a highly specialized technique (and hence, probably not useful as a replacement for Perl regular expressions). Also, the abstract suggests fast searches will be able to be done after doing (a lot of) preprocessing work. That is, you are repeatedly searching the same text. This is not typical for Perl programs. (It's also what "study" was intended to be for). Seems to me that maybe its time to start thinking about a new Regular Expression Engine for Perl, maybe as an add-on extension, so we don't need to break support to old applications. Are you volunteering? I think p5p badly needs another person to understand the regexp-engine. The current number of people who have a deep knowledge of the engine can be counted on the fingers of one hand. Patches welcome.	[reply]
Re^2: The Dream of a New, Faster Regular Expression Engine by monsieur_champs (Curate) on Oct 05, 2004 at 15:39 UTC
In pieces: "X" is the lenght of the text (when matching one text against more than one regular expression). Read the full paper and you'll understand. This specific engine still very specialized. But the technique used on it is more general than the Perl's current implementation and innovative, and deserves a deeper study, and maybe even a implementation. I'm thinking seriously on volunteering. I believe that such a work could help me achieve a MSc degree (and this is very desirable) while still contributing to the Perl Community. I'm looking for someone with time and patience enough to guide me. Maybe you're interested. Send a patch to the Perl source code stills on my computer dreams list. :-) I'll send a good one someday. -- monsieur_champs	[reply]
Re^3: The Dream of a New, Faster Regular Expression Engine by Ovid (Cardinal) on Oct 05, 2004 at 17:41 UTC
Don't think about volunteering. Do it! I'll make my first request: make the regex engine re-entrant. It was supposed to be, but it's not. This is a serious limitation that admittedly doesn't hit many people, but when it does, it's a killer. I've toyed with the idea of using Inline::Ruby for an application because this is one of the things that Ruby's regex engine can do that Perl's cannot. We desperately need volunteers who have the skills to pull this off. I certainly don't, otherwise I would have done it :) Cheers, Ovid New address of my CGI Course.	[reply]
Re^4: The Dream of a New, Faster Regular Expression Engine by monsieur_champs (Curate) on Oct 05, 2004 at 18:37 UTC
Re^3: The Dream of a New, Faster Regular Expression Engine by Anonymous Monk on Oct 06, 2004 at 10:52 UTC
I'd like to read the full paper, but I didn't see a link to the full paper from the link you provided. Perhaps I missed it. Could you post a link to the full paper? As for doing the match in 'O(lg(X))', that surely is after some preprocessing - as otherwise it would be impossible, you'd need O(X) time to read in the text to match against. The abstract doesn't mention the preprocessing time, and for typical matches in a Perl program, this is significant, as most texts (strings) will be matched against only once.	[reply]
Re^4: The Dream of a New, Faster Regular Expression Engine by monsieur_champs (Curate) on Oct 06, 2004 at 13:38 UTC
Re^5: The Dream of a New, Faster Regular Expression Engine by Anonymous Monk on Oct 07, 2004 at 09:30 UTC
Some notes below your chosen depth have not been shown here
Re^2: The Dream of a New, Faster Regular Expression Engine by DrHyde (Prior) on Oct 08, 2004 at 09:08 UTC
On the fingers of one hand? That's better than I thought then, I thought it was the fingers of one foot.	[reply]
Re: A New, Faster Regular Expression Engine? the devil is in the details. by wufnik (Friar) on Oct 11, 2004 at 23:38 UTC
hola all i have checked the papers by dr Gonnet et al, and want to clear up a number of things re: the algorithms that monsieur champs rightly draws our attention to. i believe these points should attenuate somewhat the enthusiasm for the dream, idealist tho i am. 1) the speedups pertaining to regular expressions achieved by Gonnet et al distinguish static from dynamic text. this is particularly important for us, as the speedups attained are attained because the text is 'preprocessed', or indexed, into a patricia trie, prior to regex matching. these speedups can be significant if the text you are using is (like a genome) static. if you are dealing with dynamic text, i assume any speedups you might have achieved lost in the frequent reindexing that would be required for the methods of gonnet et al to work. gonnet's example is that of matching against the OED; think to yourself, how many of my regexen are used against this corpus. if the answer is yes, then vote for the new dream by all means. if your answers are more varied, you may prefer to read... point 2: 2) regardless of the time aspect of the problem in 1) the patricia tries used take up significant amounts of memory. i quote from Gonnet: 'for some applications, patricia tries still need too much space'. Do you want to build this unpredictability into perl's robust, pretty fast, hand-tailored regex engine? i would rather not, even if someone else does all the hard work ;-).and also... 3) the paper explicitly states that ' finding an algorithm with logarithmic search time for any RE query is still an open problem'. so was monsiuer champs's friend was wrong? undoubtedly not - for the special text (singular) he considered. the metrics to decide whether your text will also do this are... alas, not there, at the mo. to the best of my knowledge. the worst case for the algorithm based on searching a patricia trie (i distinguish this from text) is linear. the algorithm performs less well in the presence of repeats - interesting, given the prevalence of repeats in genomic data. and <gasp>, lastly... 4) i think it's worth distinguishing the above from approximate matching techniques, some of which are, to the best of my knowledge, not very fast (esp. the dynamic prog ramming based ones, which give you the best results imho). nb: i think it's great that the subject above was brought up, it's interesting & thought provoking. i find myself compelled to be conservative with what i know & love, however. match that! ;-) ...wufnik -- in the world of the mules there are no rules --	[reply]


XP is just a number
	PerlMonks