in reply to Re: regexp performance on large logfiles
in thread regexp performance on large logfiles

I'm matching surfdata (i.e. url's) against a list of almost a thousand patterns in the first iteration.

The data looks something like:

1217930320 jydawg http://www.perlmonks.org/index.pl

and a relevant pattern might look something like:

*.perlmonks.org

With the large amount of patterns I wonder if I would benefit from Regexp::Assemble or Regexp::Trie.

I'm looking into the other 2 pieces of advise. I don't suppose sticking a Gigabite of text data in an array is a good idea :P. Well maybe in chunks ...

Replies are listed 'Best First'.
Re^3: regexp performance on large logfiles
by Corion (Patriarch) on Aug 05, 2008 at 10:44 UTC

    At least for those simplicistic patterns, you'll likely do much better if you just do a substring search. The regex engine will also do the substring search for you, but if you don't have www.*.info as valid pattern, using substr index could be faster. Even if you have such patterns, you can possibly move them towards the end of the checks so the rejection can happen earlier.

    Update: I meant index...

      current runtime: 25,4 minutes. just by re-examining the patterns and use as much index and substr as possible.
      Do you mean index? That shouldn't be any faster than a compiled regexp. The problem the OP has is that he's compiling his regexps over and over and over and over again.