dgaramond2 has asked for the wisdom of the Perl Monks concerning the following question:

Suppose I have a lot of short strings (s1, s2, ... sn). I want to check, as fast as possible, whether any of them is in a long string (S). The short strings are not necessarily words, so I cannot use full-text indexing. Is there a smarter, more efficient way then doing:

S =~ /(s1|s2|...|sn)/

Btw, it'd be nice if I can do case-insensitive match too.

The application is for blocking emails which might contain a list of keywords/strings which the user specifies in a blacklist.

Replies are listed 'Best First'.
Re: Fast searching of multiple substrings in a string
by Corion (Patriarch) on Oct 07, 2006 at 14:52 UTC

    Aside from building one combined regex by using Regexp::Assemble, it comes down to a tradeoff between doing multiple substr() queries against the string and running one regex against the string. If you care, the development version of Perl, "bleadperl", has an improvements regex engine by demerphq which make such searching faster.

    The only optimization besides using a different version of Perl that could make your regex faster would be to collect all the first letters of all your words and make them into a zero-width lookahead character class I think, but I would make sure by benchmarking that this really brings a speedup.

      I was inspired to benchmark your advice. In my testing environment, I was astonished to find out that the speed increases as you collect more and more letters (first letters, then second letters, etc.) in your look-ahead.
      Benchmark: timing 200 iterations of lookahead1, lookahead3, lookahead4 +, lookahead5, nolookahead... lookahead1: 3 wallclock secs ( 2.93 usr + 0.00 sys = 2.93 CPU) @ 68 +.26/s (n=200) lookahead3: 2 wallclock secs ( 2.23 usr + 0.00 sys = 2.23 CPU) @ 89 +.69/s (n=200) lookahead4: 2 wallclock secs ( 2.03 usr + 0.00 sys = 2.03 CPU) @ 98 +.52/s (n=200) lookahead5: 2 wallclock secs ( 2.00 usr + 0.00 sys = 2.00 CPU) @ 10 +0.00/s (n=200) nolookahead: 7 wallclock secs ( 6.30 usr + 0.00 sys = 6.30 CPU) @ 3 +1.75/s (n=200)
Re: Fast searching of multiple substrings in a string
by jwkrahn (Abbot) on Oct 07, 2006 at 16:50 UTC
Re: Fast searching of multiple substrings in a string
by betterworld (Curate) on Oct 07, 2006 at 14:44 UTC
    Regexp::Optimizer could be of help. It turns /bar|baz/ into /ba[rz]/, which is supposed to run faster.