Re^2: regexp performance on large logfiles

I'm matching surfdata (i.e. url's) against a list of almost a thousand patterns in the first iteration.

The data looks something like:

1217930320 jydawg http://www.perlmonks.org/index.pl

and a relevant pattern might look something like:

*.perlmonks.org

With the large amount of patterns I wonder if I would benefit from Regexp::Assemble or Regexp::Trie.

I'm looking into the other 2 pieces of advise. I don't suppose sticking a Gigabite of text data in an array is a good idea :P. Well maybe in chunks ...

Comment on Re^2: regexp performance on large logfiles Select or Download Code

Replies are listed 'Best First'.
Re^3: regexp performance on large logfiles by Corion (Patriarch) on Aug 05, 2008 at 10:44 UTC
At least for those simplicistic patterns, you'll likely do much better if you just do a substring search. The regex engine will also do the substring search for you, but if you don't have `www..info` as valid pattern, using ~~substr~~ index could be faster. Even if you have such patterns, you can possibly move them towards the end of the checks so the rejection can happen earlier. Update*: I meant index...	[reply] [d/l]
Re^4: regexp performance on large logfiles by snl_JYDawg (Initiate) on Aug 05, 2008 at 13:37 UTC
current runtime: 25,4 minutes. just by re-examining the patterns and use as much index and substr as possible.	[reply]
Re^4: regexp performance on large logfiles by ikegami (Patriarch) on Aug 05, 2008 at 17:47 UTC
Do you mean `index`? That shouldn't be any faster than a compiled regexp. The problem the OP has is that he's compiling his regexps over and over and over and over again.	[reply] [d/l]