Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^4: The Dream of a New, Faster Regular Expression Engine

by monsieur_champs (Curate)
on Oct 06, 2004 at 13:38 UTC ( [id://396982]=note: print w/replies, xml ) Need Help??


in reply to Re^3: The Dream of a New, Faster Regular Expression Engine
in thread The Dream of a New, Faster Regular Expression Engine

Ok, here is the paper.

  • Comment on Re^4: The Dream of a New, Faster Regular Expression Engine

Replies are listed 'Best First'.
Re^5: The Dream of a New, Faster Regular Expression Engine
by Anonymous Monk on Oct 07, 2004 at 09:30 UTC
    Well, I'm confused where you get the O(lg(x)) bound from. According to the paper, the running time is bounded by O(N(w(Mm)α + MmI(m + I))), and the space used is bounded by O(Mm + w log(Mm) + M(m + I)), where N is the number of words in the text to be searched, M is the number of patterns to search for, each pattern containing m words, w is the average length of a word, I is the number of insertions of foreign words, and α is some constant between 0 and 1, depending on tresholds. If we take w, m, I and α to be fixed, the bounds simplify to O(NM) time, and O(M) space. They make some arguments that in practise, the running time is bounded by O(NMα).

    But from what I can tell, this is just the running time for a match - it doesn't include the preprocessing time.

    The algorithm hinges on the following properties:

    1. The set of patterns doesn't change.
    2. The text to search in can be tokenized, and we are interested in finding (groups of) tokens.
    3. The patterns are simple strings.
    4. It'll do "fuzzy" matches.
    All of the items above would limit what now is possible with Perl regexes. Point 3 for instance means patterns can't use quantifiers, alteration or backreferences. The only interesting this about this algorithm is that it'll do "fuzzy" matches, but that's something you often do not want when matching. But if you want to do non-fuzzy matches of simple strings, just doing repeated index calls will give you an O(NM) time bound as well.

    Perhaps there's something in this paper that can be used for the Perl regex engine. But I didn't find anything when reading it. Perhaps you can elaborate?

      The "O(log(x))" thing was told to me by Joćo Marcelo, one of the co-authors of the paper. I personally have not calculated this yet.

      And yes, I will to elaborate a solution for the regular expression engine, as soon as I start my PhD tesis. If everything goes well, this would happen before Perl6's first release :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://396982]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-04-19 03:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found