Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: The Dream of a New, Faster Regular Expression Engine

by monsieur_champs (Curate)
on Oct 05, 2004 at 15:39 UTC ( [id://396624]=note: print w/replies, xml ) Need Help??


in reply to Re: The Dream of a New, Faster Regular Expression Engine
in thread The Dream of a New, Faster Regular Expression Engine

In pieces:

  1. "X" is the lenght of the text (when matching one text against more than one regular expression). Read the full paper and you'll understand.
  2. This specific engine still very specialized. But the technique used on it is more general than the Perl's current implementation and innovative, and deserves a deeper study, and maybe even a implementation.
  3. I'm thinking seriously on volunteering. I believe that such a work could help me achieve a MSc degree (and this is very desirable) while still contributing to the Perl Community. I'm looking for someone with time and patience enough to guide me. Maybe you're interested.
  4. Send a patch to the Perl source code stills on my computer dreams list. :-) I'll send a good one someday.
  • Comment on Re^2: The Dream of a New, Faster Regular Expression Engine

Replies are listed 'Best First'.
Re^3: The Dream of a New, Faster Regular Expression Engine
by Ovid (Cardinal) on Oct 05, 2004 at 17:41 UTC

    Don't think about volunteering. Do it! I'll make my first request: make the regex engine re-entrant. It was supposed to be, but it's not. This is a serious limitation that admittedly doesn't hit many people, but when it does, it's a killer. I've toyed with the idea of using Inline::Ruby for an application because this is one of the things that Ruby's regex engine can do that Perl's cannot.

    We desperately need volunteers who have the skills to pull this off. I certainly don't, otherwise I would have done it :)

    Cheers,
    Ovid

    New address of my CGI Course.

      Dear ovid
      I'm just a poor mortal that thinks big. And this doesn't qualify me (at least not in a near future) with the skills needed to make this change. I'll need a powerful Perl Master that is interested on this problem, and a big chunk of time to study and develop this.

      I have the interest, but lack the skills and/or a master to guide me in this journey.

      I'm really sorry about this.

Re^3: The Dream of a New, Faster Regular Expression Engine
by Anonymous Monk on Oct 06, 2004 at 10:52 UTC
    I'd like to read the full paper, but I didn't see a link to the full paper from the link you provided. Perhaps I missed it. Could you post a link to the full paper?

    As for doing the match in 'O(lg(X))', that surely is after some preprocessing - as otherwise it would be impossible, you'd need O(X) time to read in the text to match against. The abstract doesn't mention the preprocessing time, and for typical matches in a Perl program, this is significant, as most texts (strings) will be matched against only once.

        Well, I'm confused where you get the O(lg(x)) bound from. According to the paper, the running time is bounded by O(N(w(Mm)α + MmI(m + I))), and the space used is bounded by O(Mm + w log(Mm) + M(m + I)), where N is the number of words in the text to be searched, M is the number of patterns to search for, each pattern containing m words, w is the average length of a word, I is the number of insertions of foreign words, and α is some constant between 0 and 1, depending on tresholds. If we take w, m, I and α to be fixed, the bounds simplify to O(NM) time, and O(M) space. They make some arguments that in practise, the running time is bounded by O(NMα).

        But from what I can tell, this is just the running time for a match - it doesn't include the preprocessing time.

        The algorithm hinges on the following properties:

        1. The set of patterns doesn't change.
        2. The text to search in can be tokenized, and we are interested in finding (groups of) tokens.
        3. The patterns are simple strings.
        4. It'll do "fuzzy" matches.
        All of the items above would limit what now is possible with Perl regexes. Point 3 for instance means patterns can't use quantifiers, alteration or backreferences. The only interesting this about this algorithm is that it'll do "fuzzy" matches, but that's something you often do not want when matching. But if you want to do non-fuzzy matches of simple strings, just doing repeated index calls will give you an O(NM) time bound as well.

        Perhaps there's something in this paper that can be used for the Perl regex engine. But I didn't find anything when reading it. Perhaps you can elaborate?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://396624]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (1)
As of 2024-04-24 14:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found