Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

The Dream of a New, Faster Regular Expression Engine

by monsieur_champs (Curate)
on Oct 05, 2004 at 14:02 UTC ( [id://396581]=perlmeditation: print w/replies, xml ) Need Help??

Fellows
A friend of mine pointed me to a paper published by the ACM that describes a new kind of regular expression engine, that, as far as I understand, is able to match (by the aproximate match technique) hundreds of regular expressions against one text in O( ln( x ) ) (!!!!!). If this is true, I guess we will need to reimplement (or, at least, review) our regular expressions engine.

I'm learning about this and don't know a lot about Approximated Matching, but it seems to be a generalization of the Exact Pattern Match problem (that's Perl's current regular expression engine technique).

I guess this kind of pattern matching is a subproduct from the biology efforts on the Genoma Project and other genetic sequencing projects arround the world, and that this have enough maturity to guarantee a new implementation of a regular expression engine to Perl.

I also googled for and found a handful of papers on the subject of multiple string matching by Gonzalo Navarro, a researcher at the Computer Science Department, University of Chile.

Seems to me that maybe its time to start thinking about a new Regular Expression Engine for Perl, maybe as an add-on extension, so we don't need to break support to old applications. This can help Perl to solve problems faster and simpler, allowing us to solve more with even less code.

Replies are listed 'Best First'.
Re: The Dream of a New, Faster Regular Expression Engine
by diotalevi (Canon) on Oct 05, 2004 at 15:42 UTC

    You may want to take a look at Regexp::Approx - Use fuzzy regular expressions. It embeds String::Approx function calls into the standard regular expression engine.

    It would be valuable to have a proper approximate matching engine. So please, don't let this code lull you into thinking that the more general solution wouldn't be immensely preferred.

Re: The Dream of a New, Faster Regular Expression Engine
by hardburn (Abbot) on Oct 05, 2004 at 14:46 UTC

    If you want to build this into a module, you might want to read MJD's How Regexes Work. I hope you have some pennies around.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      I'm thinking seriously about this, for my MSc degree. But this stills under evaluation. I'm talking about this with an old friend and graduation teacher.

      Maybe this ends on a nice and good project.

Re: The Dream of a New, Faster Regular Expression Engine
by Anonymous Monk on Oct 05, 2004 at 14:51 UTC
    I've no idea what you mean by "to match hundreds of regular expressions against one text in O(lg(x))". What is x? It surely isn't the length of the text, or the length or number of matches you want to perform, as you will need at least linear time to read in the text and the matches. The abstract you link to doesn't mention this time either.

    From what I understand of the abstract, this is a highly specialized technique (and hence, probably not useful as a replacement for Perl regular expressions). Also, the abstract suggests fast searches will be able to be done after doing (a lot of) preprocessing work. That is, you are repeatedly searching the same text. This is not typical for Perl programs. (It's also what "study" was intended to be for).

    Seems to me that maybe its time to start thinking about a new Regular Expression Engine for Perl, maybe as an add-on extension, so we don't need to break support to old applications.

    Are you volunteering? I think p5p badly needs another person to understand the regexp-engine. The current number of people who have a deep knowledge of the engine can be counted on the fingers of one hand. Patches welcome.

      In pieces:

      1. "X" is the lenght of the text (when matching one text against more than one regular expression). Read the full paper and you'll understand.
      2. This specific engine still very specialized. But the technique used on it is more general than the Perl's current implementation and innovative, and deserves a deeper study, and maybe even a implementation.
      3. I'm thinking seriously on volunteering. I believe that such a work could help me achieve a MSc degree (and this is very desirable) while still contributing to the Perl Community. I'm looking for someone with time and patience enough to guide me. Maybe you're interested.
      4. Send a patch to the Perl source code stills on my computer dreams list. :-) I'll send a good one someday.

        Don't think about volunteering. Do it! I'll make my first request: make the regex engine re-entrant. It was supposed to be, but it's not. This is a serious limitation that admittedly doesn't hit many people, but when it does, it's a killer. I've toyed with the idea of using Inline::Ruby for an application because this is one of the things that Ruby's regex engine can do that Perl's cannot.

        We desperately need volunteers who have the skills to pull this off. I certainly don't, otherwise I would have done it :)

        Cheers,
        Ovid

        New address of my CGI Course.

        I'd like to read the full paper, but I didn't see a link to the full paper from the link you provided. Perhaps I missed it. Could you post a link to the full paper?

        As for doing the match in 'O(lg(X))', that surely is after some preprocessing - as otherwise it would be impossible, you'd need O(X) time to read in the text to match against. The abstract doesn't mention the preprocessing time, and for typical matches in a Perl program, this is significant, as most texts (strings) will be matched against only once.

      On the fingers of one hand? That's better than I thought then, I thought it was the fingers of one foot.
Re: A New, Faster Regular Expression Engine? the devil is in the details.
by wufnik (Friar) on Oct 11, 2004 at 23:38 UTC
    hola all

    i have checked the papers by dr Gonnet et al, and want to clear up a number of things re: the algorithms that monsieur champs rightly draws our attention to. i believe these points should attenuate somewhat the enthusiasm for the dream, idealist tho i am.

    1) the speedups pertaining to regular expressions achieved by Gonnet et al distinguish static from dynamic text. this is particularly important for us, as the speedups attained are attained because the text is 'preprocessed', or indexed, into a patricia trie, prior to regex matching.

    these speedups can be significant if the text you are using is (like a genome) static. if you are dealing with dynamic text, i assume any speedups you might have achieved lost in the frequent reindexing that would be required for the methods of gonnet et al to work. gonnet's example is that of matching against the OED; think to yourself, how many of my regexen are used against this corpus. if the answer is yes, then vote for the new dream by all means. if your answers are more varied, you may prefer to read... point 2:

    2) regardless of the time aspect of the problem in 1) the patricia tries used take up significant amounts of memory. i quote from Gonnet: 'for some applications, patricia tries still need too much space'. Do you want to build this unpredictability into perl's robust, pretty fast, hand-tailored regex engine? i would rather not, even if someone else does all the hard work ;-).and also...

    3) the paper explicitly states that ' finding an algorithm with logarithmic search time for any RE query is still an open problem'. so was monsiuer champs's friend was wrong? undoubtedly not - for the special text (singular) he considered. the metrics to decide whether *your* text will also do this are... alas, not there, at the mo. to the best of my knowledge. the worst case for the algorithm based on searching a patricia trie (i distinguish this from text) is linear. the algorithm performs less well in the presence of repeats - interesting, given the prevalence of repeats in genomic data. and <gasp>, lastly...

    4) i think it's worth distinguishing the above from approximate matching techniques, some of which are, to the best of my knowledge, *not* very fast (esp. the dynamic prog ramming based ones, which give you the best results imho).

    nb: i think it's great that the subject above was brought up, it's interesting & thought provoking. i find myself compelled to be conservative with what i know & love, however. match that! ;-)

    ...wufnik

    -- in the world of the mules there are no rules --

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://396581]
Approved by herveus
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-26 05:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found