pattern matching with large regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to test chunks of text (1-100k) containing any one of several thousand short regex strings of about 5-20 characters. I've been joining the regex strings into a single string of a few 100k with | for use in a pattern match like $text =~ /$the_big_string/; Although it works I wonder about the efficiency, limits and alternatives to such an approach. Thank you for your time and expertise.

Comment on pattern matching with large regex

Replies are listed 'Best First'.
Re: pattern matching with large regex by CountZero (Bishop) on Aug 13, 2005 at 17:48 UTC
The only way to tell is to time your approach against any of the alternatives. The most obvious alternative would be --of course-- to run your regex-match in a loop and checking each alternative separately. My feeling however is that this will be much slower. Another alternative would be to hand-craft the regex by taking into account the same or similar chunks in each of the alternatives. The writing of the regex will take much more time and I'm not sure if such a more elaborate (but shorter) regex will work faster than a simple bunch of alternatives. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^2: pattern matching with large regex by Tanktalus (Canon) on Aug 13, 2005 at 17:59 UTC
Looping will be slower? According to simonm, matching multiple times is faster than a single super-match. Only a benchmark of actual data will show for sure, but I've seen it said on PM many times already that multiple smaller matches are faster than single larger-matches. I'm presuming that simpler FSMs combined with less ability to backtrack has something to do with it. I'm not entirely sure, though.	[reply]
Re^3: pattern matching with large regex by CountZero (Bishop) on Aug 13, 2005 at 18:17 UTC
Indeed only benchmarking will do, as a lot depends on the data-set (are the regexes anchored or not; is there room for a lot of back-tracking, ...). CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re^2: pattern matching with large regex by fishbot_v2 (Chaplain) on Aug 13, 2005 at 17:55 UTC
Or instead of handcrafting it, use something like Regex::PreSuf, which groups substrings by longest prefixes (among other things). If you are using a static set of alternations, I would recommend caching this somehow, though, it is rather expensive time-wise. Additionally, it isn't guaranteed to be faster, so benchmark with reasonable data. (see Re^4: removing stop words for example with benchmark.)	[reply]
Re: pattern matching with large regex by Tanktalus (Canon) on Aug 13, 2005 at 17:50 UTC
A little more detail on what you're doing with the code would be helpful. For example, are you just testing for existance, or are you extracting pieces of data? Are these regular expressions using metacharacters such as `.*[]?`, or are they constant strings? Each of these answers may help us help you optimise your code appropriately. For example, constant strings generally are faster with index than regular expressions. But if you have thousands, and you use a regular expression optimizer of some sort from CPAN, you may be able to get a reasonable state machine for finding your data. On the other hand if you're trying to extract data, which I kind of doubt, and your regular expressions actually use regexp metacharacters, you're probably best off looping through the list: `my @regexps = load_regexps(); @regexps = map { qr/$_/ } @regexps; # pre-compile 'em all. foreach my $re (@regexps) { if ($text =~ $re) { # do stuff based on match. } }` [download] Here we precompile each one, and then try each one after another. The compiled regular expressions should execute a bit faster - I'm not sure why, but I'm guessing because the state machine is way simpler. Note that if you only check a single chunk of text, you won't save anything by pre-compiling the regular expressions.	[reply] [d/l] [select]
Re^2: pattern matching with large regex by Anonymous Monk on Aug 13, 2005 at 21:32 UTC
Most of the regex strings are constant, a few hundred may contain simple constructs like alternation and character classes: `(f?oo\|bar\|baz\|etc)[\w\-]\.[0-9]{3,})` We only extract the data if it matches. As many have suggested I benchmarked a typical case with the actual data and unless something is wrong the difference is extreme: `my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]\.)?($big_strin +g)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@strings){ if($text=~/(stuff?)m0r3(?:[^:]* +\.)?($_)/i){my $match="$1:$2"}}}, ); print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; cmpthese( 0, \%cases);` [download] Results: `$text = 4578 characters $big_string = 210724 characters @strings = 10634 items Rate many_small one_large many_small 1.05/s -- -100% one_large 630/s 60089% -- --` [download]	[reply] [d/l] [select]
Re^3: pattern matching with large regex by Tanktalus (Canon) on Aug 13, 2005 at 23:23 UTC
Not having any of the data that you're working with, all I can do is offer suggestions that may or may not help - I can't actually test them out to see that if they don't work, I can keep my mouth shut. ;-) So, I'm just curious what happens when you a) use a regexp optimiser from CPAN to "optimise" $big_string (of course, proving that the optimisation didn't break anything would be a bit painful), and b) pre-compile your @strings - e.g.: print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; my $big_regexp = Regexp::Optimizer->new()->optimize($bit_string); my @small_regexps = map { qr/$_/i } @strings; my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]\.)?($big_regex +p)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@small_regexps){ if($text=~/(stuff?)m0r3(? +:[^:]\.)?($_)/i){my $match="$1:$2"}}}, ); cmpthese( 0, \%cases); [download]	[reply] [d/l]
Re^4: pattern matching with large regex by Anonymous Monk on Aug 14, 2005 at 07:43 UTC
Re^3: pattern matching with large regex by lidden (Curate) on Aug 13, 2005 at 21:46 UTC
In your 'one_large' example you get the first match. In 'many_small' you get the last one, try adding a `last` when you get a match in the for loop and see what happens.	[reply] [d/l]
Re^4: pattern matching with large regex by Anonymous Monk on Aug 13, 2005 at 22:10 UTC
Re: pattern matching with large regex by planetscape (Chancellor) on Aug 14, 2005 at 14:29 UTC
grinder's Regexp::Assemble is an alternative to Regex::PreSuf. Please see Why machine-generated solutions will never cease to amaze me and grinder's scratchpad for more on this remarkable module. HTH, planetscape	[reply]

Back to Seekers of Perl Wisdom