http://qs1969.pair.com?node_id=483578

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I need to test chunks of text (1-100k) containing any one of several thousand short regex strings of about 5-20 characters. I've been joining the regex strings into a single string of a few 100k with | for use in a pattern match like $text =~ /$the_big_string/; Although it works I wonder about the efficiency, limits and alternatives to such an approach. Thank you for your time and expertise.

Replies are listed 'Best First'.
Re: pattern matching with large regex
by CountZero (Bishop) on Aug 13, 2005 at 17:48 UTC
    The only way to tell is to time your approach against any of the alternatives. The most obvious alternative would be --of course-- to run your regex-match in a loop and checking each alternative separately.

    My feeling however is that this will be much slower.

    Another alternative would be to hand-craft the regex by taking into account the same or similar chunks in each of the alternatives. The writing of the regex will take much more time and I'm not sure if such a more elaborate (but shorter) regex will work faster than a simple bunch of alternatives.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Looping will be slower? According to simonm, matching multiple times is faster than a single super-match. Only a benchmark of actual data will show for sure, but I've seen it said on PM many times already that multiple smaller matches are faster than single larger-matches. I'm presuming that simpler FSMs combined with less ability to backtrack has something to do with it. I'm not entirely sure, though.

        Indeed only benchmarking will do, as a lot depends on the data-set (are the regexes anchored or not; is there room for a lot of back-tracking, ...).

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      Or instead of handcrafting it, use something like Regex::PreSuf, which groups substrings by longest prefixes (among other things).

      If you are using a static set of alternations, I would recommend caching this somehow, though, it is rather expensive time-wise. Additionally, it isn't guaranteed to be faster, so benchmark with reasonable data. (see Re^4: removing stop words for example with benchmark.)

Re: pattern matching with large regex
by Tanktalus (Canon) on Aug 13, 2005 at 17:50 UTC

    A little more detail on what you're doing with the code would be helpful. For example, are you just testing for existance, or are you extracting pieces of data? Are these regular expressions using metacharacters such as .*[]?, or are they constant strings?

    Each of these answers may help us help you optimise your code appropriately. For example, constant strings generally are faster with index than regular expressions. But if you have thousands, and you use a regular expression optimizer of some sort from CPAN, you may be able to get a reasonable state machine for finding your data.

    On the other hand if you're trying to extract data, which I kind of doubt, and your regular expressions actually use regexp metacharacters, you're probably best off looping through the list:

    my @regexps = load_regexps(); @regexps = map { qr/$_/ } @regexps; # pre-compile 'em all. foreach my $re (@regexps) { if ($text =~ $re) { # do stuff based on match. } }
    Here we precompile each one, and then try each one after another. The compiled regular expressions should execute a bit faster - I'm not sure why, but I'm guessing because the state machine is way simpler. Note that if you only check a single chunk of text, you won't save anything by pre-compiling the regular expressions.

      Most of the regex strings are constant, a few hundred may contain simple constructs like alternation and character classes: (f?oo|bar|baz|etc)[\w\-]*\.[0-9]{3,}) We only extract the data if it matches. As many have suggested I benchmarked a typical case with the actual data and unless something is wrong the difference is extreme:
      my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]*\.)?($big_strin +g)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@strings){ if($text=~/(stuff?)m0r3(?:[^:]* +\.)?($_)/i){my $match="$1:$2"}}}, ); print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; cmpthese( 0, \%cases);
      Results:
      $text = 4578 characters $big_string = 210724 characters @strings = 10634 items Rate many_small one_large many_small 1.05/s -- -100% one_large 630/s 60089% -- --

        Not having any of the data that you're working with, all I can do is offer suggestions that may or may not help - I can't actually test them out to see that if they don't work, I can keep my mouth shut. ;-)

        So, I'm just curious what happens when you a) use a regexp optimiser from CPAN to "optimise" $big_string (of course, proving that the optimisation didn't break anything would be a bit painful), and b) pre-compile your @strings - e.g.:

        print '$text = ', length $text, " characters\n", '$big_string = ', length $big_string, " characters\n", '@strings = ', scalar @strings, " items\n\n"; my $big_regexp = Regexp::Optimizer->new()->optimize($bit_string); my @small_regexps = map { qr/$_/i } @strings; my %cases = ( 'one_large' => sub { if($text=~/(stuff?)m0r3(?:[^:]*\.)?($big_regex +p)/i){my $match="$1:$2"}}, 'many_small' => sub { for(@small_regexps){ if($text=~/(stuff?)m0r3(? +:[^:]*\.)?($_)/i){my $match="$1:$2"}}}, ); cmpthese( 0, \%cases);
        In your 'one_large' example you get the first match. In 'many_small' you get the last one, try adding a last when you get a match in the for loop and see what happens.
Re: pattern matching with large regex
by planetscape (Chancellor) on Aug 14, 2005 at 14:29 UTC