kruppy has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, I am trying to write a script that will identify the presence of certain terms in a large number of entered text messages. I am experiencing time efficieny problems and would like to know if I could have done anything better. The core of the code, which is the slow part, looks virtually like this: ...
sub slow_match { my($hash_ref_1,$hash_ref_2) = @_; while ( my($pattern,$high_lvl_id) = each(%hash1) ) { my $match = qr/\b$pattern\b/; while ( my($text_id,$text) = each(%hash2) ) { if ($text =~ $match) { $$hash_ref_1{$text_id} .= ':'.$high_lvl_id; foreach my $part (split(/\s/,$pattern)) { $$hash_ref_2{$text_id} -> {$part} = 0; } } } } }
... There are in the order of 20,000 texts and 50,000-100,000 patterns, and there is only on average about 1 pattern matching each text. The script runs in roughly 1.5 hours on my server, which I was hoping to improve. Using the qr// did improve it very much, but perhaps it could be even better. Many thanks, Ola

Replies are listed 'Best First'.
Re: Efficient regex matching with qr//; Can I do better?
by ysth (Canon) on Jul 11, 2008 at 06:02 UTC
    You are copying your texts a lot of times unnecessarily. You might try an array instead:
    sub slow_match { my($hash_ref_1,$hash_ref_2) = @_; my @text_id = keys %hash2; my @text = values %hash2; while ( my($pattern,$high_lvl_id) = each(%hash1) ) { my $match = qr/\b$pattern\b/; for my $text_index (0..$#text) { if ($text[$text_index] =~ $match) { $$hash_ref_1{$text_id[$index_index]} .= ':'.$high_lvl_id; foreach my $part (split(/\s/,$pattern)) { $$hash_ref_2{$text_id[$text_index]} -> {$part} = 0; } } } } }
Re: Efficient regex matching with qr//; Can I do better?
by moritz (Cardinal) on Jul 11, 2008 at 07:43 UTC
    ... There are in the order of 20,000 texts and 50,000-100,000 patterns,

    In general it's much faster to match one large regex than many regexes many times.

    That means you could try to assemble a regex of $x original regexes into one.

    Now it seems you have to know which regex matched, which means you have to distinguish them. In perl 5.10.0 or above you can use named captures. If you can't require such a new perl version, you can try something like this instead:

    our $which_matched; sub assemble_regex { my %regexes = @_; return join '|', map { q[(?:$regexes{$_})(?{\$which_matched='$_'})]} keys %regexes; }

    This assumes that keys in %regexes don't contain single quotes and trailing backslashes.

    If many of the patterns are constant strings, consider upgrading to perl 5.10.0 - it greatly speeds up matching of many constant alternatives.

      Ok, so I managed to get version 5.10 on to my local machine at least. But I am not really sure how one would construct such large, aggregated, regexes. I didn't get any further than this:
      if ($text =~ /(?<p1>\b$pattern1\b)|(?<p2>\b$pattern2\b)/) { foreach my $foo (keys %+) {print $foo.','.$+{$foo}."\n";} }
      but then of course I only get one of the matches (even if both match).

      Could you please give me a little example of such a construct? The documentation didn't help me much, I'm afraid.

      Many thanks, Ola
        If you want the alternations to match at the same starting position, you might be able to fiddle something together with look-ahead groups (not sure it works), but generally that doesn't work very well.

        You could try to match once, reset pos to the previous starting position, remove the regex that caused the match and retry again. But I don't think that's very efficient.

        If you don't want to match at the same position, you can use the /g modifier in a while loop to match multiple times.

Re: Efficient regex matching with qr//; Can I do better?
by ikegami (Patriarch) on Jul 11, 2008 at 06:00 UTC

    If it's worthwhile to take the regexp compiling outside of the inner while, it's probably worthwhile to take the split out too.

    If we knew what you were doing, including having a idea what the patterns are, we might be able to provide a better algorithm.

      Since the if block is only entered about 20000 times, changes there aren't likely to make a big difference.

      But yes, seeing what kind of patterns these are would provide some clues. If they are literal strings, for instance, it may be better to not do the qr// but instead do

      if (0 <= index($text, $pattern) && $text =~ /\b$pattern\b/) { }
      (the latter bit just to preserve the \b checks).
        From what I understand, a compiled regexp is just as fast as index for literal strings.
Re: Efficient regex matching with qr//; Can I do better?
by Anonymous Monk on Jul 11, 2008 at 09:30 UTC
    Per ikegami's suggestion that it is advantageous to do as much work outside the loop(s) as possible, this code may be helpful (although I'm not really sure I understand the problem completely):

    use warnings; use strict; use Data::Dumper; MAIN: { # begin main loop # define hashes for text and patterns. my %text = ( text_id_1 => 'four score and foo bar baz seven years', text_id_2 => 'to fee fie foe or not', text_id_3 => 'to wibble or not to wobble', text_id_4 => 'text with no patterns present', text_id_5 => 'text with fee fie foe and wobble present', # ...etc. ); my %patterns = ( 'fee fie foe' => 'hi_lvl_id_1', 'foo bar baz' => 'hi_lvl_id_2', 'wibble' => 'hi_lvl_id_3', 'wobble' => 'hi_lvl_id_4', 'quux' => 'hi_lvl_id_5', # pattern present in no text # ...etc. ); # preprocess patterns hash for data extraction. while (my ($string, $id) = each %patterns) { $patterns{$string} = { hi_lvl_id => $id, regex => qr{ \b \Q$string\E \b }xms, parts => [ split /\s/, $string ], }; } # initialize output hashes. my %pattern_ids = map { $_ => '' } keys %text; my %part_ids = map { $_ => {} } keys %text; match(\%patterns, \%text, \%pattern_ids, \%part_ids); print "%pattern_ids: \n", Dumper \%pattern_ids; print "%part_ids: \n", Dumper \%part_ids; } # end main loop # subroutines ################################## sub match { my ($hr_patterns, # ref. to input hash: patterns $hr_text, # ref. to input hash: text strings $hr_pattern_ids, # ref. to output hash: identified pattern ids $hr_part_ids, # ref. to output hash: identified part ids ) = @_; PATTERN: for my $hr_pattern (values %$hr_patterns) { TEXT: while (my($text_id, $text) = each %$hr_text) { next TEXT unless $text =~ $hr_pattern->{regex}; $hr_pattern_ids->{$text_id} .= ":$hr_pattern->{hi_lvl_id}"; my @parts = @{ $hr_pattern->{parts} }; # for convenience @{ $hr_part_ids->{$text_id} }{@parts} = (0) x @parts; } # end while TEXT loop } # end for PATTERN loop }

    Output:

    %pattern_ids: $VAR1 = { 'text_id_3' => ':hi_lvl_id_3:hi_lvl_id_4', 'text_id_5' => ':hi_lvl_id_1:hi_lvl_id_4', 'text_id_2' => ':hi_lvl_id_1', 'text_id_4' => '', 'text_id_1' => ':hi_lvl_id_2' }; %part_ids: $VAR1 = { 'text_id_3' => { 'wibble' => 0, 'wobble' => 0 }, 'text_id_5' => { 'foe' => 0, 'wobble' => 0, 'fie' => 0, 'fee' => 0 }, 'text_id_2' => { 'foe' => 0, 'fie' => 0, 'fee' => 0 }, 'text_id_4' => {}, 'text_id_1' => { 'bar' => 0, 'baz' => 0, 'foo' => 0 } };

    Perhaps more advantageous would be moritz's suggestion to compile all the patterns into one massive alternation, but I would quail at the thought of a regex that might easily exceed a million characters. But if it works...

      Perhaps more advantageous would be moritz's suggestion to compile all the patterns into one massive alternation

      I didn't suggest that, because I think I once benchmarked it, and seem to recall that at some point it degrades performance (perhaps over the weekend I'll see if I can dig out my benchmark again). I suggested to chose a number, let's say 1000 for a start, and aggregate only that many alternatives into a single regex, and loop over these assembled regexes then.

        Thanks a lot all of you. I'm sorry I didn't specify the types of pattern clearly, but both the patterns and the texts consist of literals (a-z,0-9) and single spaces only. I have only had time to briefly look at your answers. The idea of doing as much as possible outside the loop is probably a good one which I'll try. The idea of using an array and specify indexes I didn't really understand the benefit of at first glance. Thirdly, the idea to just do one (or a few) large regex matches sounds very interesting. I use ActiveState Perl v. 5.8.8 for windows, so I realize it might be worth upgrading. I don't quite know what their latest version available is. Again, thanks a lot for your efforts!
      Note that the code given above has a (potential) bug: if a pattern is present two or more times in a text string, it will only be seen and processed once.

      Here is some code that does not have this bug (if, indeed, it is a bug) and that also should run faster. The (unbenchmarked) speed increase is expected because all text strings are concatenated into a single string and this string is searched for patterns by each regex, rather than having every individual text string searched by each regex.

      (This approach would probably mesh nicely with moritz's technique of concatenating large groups of patterns together as alternations.)

      =comment ASSUMPTIONS: - all text strings and patterns are composed of alphanumeric characters separated by spaces; - all pattern strings are unique (this is implied by the fact that they are used as keys in a hash). this approach will work if all the text strings can be concatenated together into a string that will fit into memory. (if not, relentless disk thrashing will ensue, along with an orders-of-magnitude speed penalty.) =cut use warnings; use strict; MAIN: { # begin main loop # define hashes for text and patterns. my %text = ( text_id_1 => 'four score and foo bar baz seven years', text_id_2 => 'to fee fie foe or not', text_id_3 => 'to wibble or not to wobble', text_id_4 => 'text with no patterns present', text_id_5 => 'text with fee fie foe and wobble present', text_id_6 => 'text with two wobble present wobble', # ...etc. ); # separator char not found in any text string or pattern string. my $sep = ':'; # anything not a separator is a text string or pattern string char. my $non_sep = qr{ [^$sep] }xms; # concatenation of all text strings. CAUTION: size! # also build hash associating text string endpoints with text ids. my $all_text = ''; my %text_endpoints; for my $text_id (keys %text) { $all_text .= $text{$text_id}; $text_endpoints{ length $all_text } = $text_id; $all_text .= $sep; } my %patterns = ( 'fee fie foe' => 'hi_lvl_id_1', 'foo bar baz' => 'hi_lvl_id_2', 'wibble' => 'hi_lvl_id_3', 'wobble' => 'hi_lvl_id_4', 'quux' => 'hi_lvl_id_5', # pattern present in no text # ...etc. ); # preprocess patterns hash for data extraction. while (my ($string, $id) = each %patterns) { $patterns{$string} = { hi_lvl_id => $id, parts => [ split /\s/, $string ], # capturing parentheses wrapped in non-capturing # look-ahead allows overlapping pattern capture. # this would be critical in the case of more than one # occurrence of a pattern string in a text string. regex => qr{ (?= ( \Q$string\E $non_sep* ) ) }xms, }; } # initialize output hashes. (this could be a single hash.) my %pattern_ids = map { $_ => '' } keys %text; my %part_ids = map { $_ => {} } keys %text; match(\$all_text, \%patterns, \%text_endpoints, \%pattern_ids, \%part_ids); use Data::Dumper; print "%pattern_ids: \n", Dumper \%pattern_ids; print "%part_ids: \n", Dumper \%part_ids; } # end main loop # subroutines ################################## sub match { my ($sr_all_text, # ref. to input scalar: all strings together $hr_patterns, # ref. to input hash: patterns $hr_text_endpoints, # ref. to input hash: text endpoints $hr_pattern_ids, # ref. to output hash: identified pattern ids $hr_part_ids, # ref. to output hash: identified part ids ) = @_; PATTERN: for my $hr_pattern (values %$hr_patterns) { TEXT: while (${$sr_all_text} =~ m{ $hr_pattern->{regex} }xmsg) { # get text id from text endpoint in @+ array. my $text_id = $hr_text_endpoints->{$+[1]}; $hr_pattern_ids->{$text_id} .= ":$hr_pattern->{hi_lvl_id}"; my @parts = @{ $hr_pattern->{parts} }; # for convenience @{ $hr_part_ids->{$text_id} }{@parts} = (0) x @parts; } # end while TEXT loop } # end for PATTERN loop }
      Output:
      %pattern_ids: $VAR1 = { 'text_id_3' => ':hi_lvl_id_3:hi_lvl_id_4', 'text_id_5' => ':hi_lvl_id_1:hi_lvl_id_4', 'text_id_2' => ':hi_lvl_id_1', 'text_id_4' => '', 'text_id_6' => ':hi_lvl_id_4:hi_lvl_id_4', 'text_id_1' => ':hi_lvl_id_2' }; %part_ids: $VAR1 = { 'text_id_3' => { 'wibble' => 0, 'wobble' => 0 }, 'text_id_5' => { 'foe' => 0, 'wobble' => 0, 'fie' => 0, 'fee' => 0 }, 'text_id_2' => { 'foe' => 0, 'fie' => 0, 'fee' => 0 }, 'text_id_4' => {}, 'text_id_6' => { 'wobble' => 0 }, 'text_id_1' => { 'bar' => 0, 'baz' => 0, 'foo' => 0 } };
        Thanks. Well, it isn't a bug. I am not interested in the multiplicity of the matching.
Re: Efficient regex matching with qr//; Can I do better?
by waba (Monk) on Jul 11, 2008 at 18:13 UTC

    You could try out the study function. However, you'll have to refactor your code so that the outer loop is done on the texts, and of course pull the regex compiling out of there.

    Assuming that %hash1 and 2 are different from $hash1_ref, this could become something like (untested):

    sub slow_match { my ( $hash_ref_1, $hash_ref_2 ) = @_; # %texts, %patterns coming from global (was: %hash1, %hash2) my %matches; foreach my $pattern ( keys %$patterns_ref ) { $matches{$pattern} = qr/\b$pattern\b/; } while ( my ( $text_id, $text ) = each %texts ) { study $text; while ( my ( $pattern, $high_lvl_id ) = each %patterns ) { if ( $text =~ $matches{$pattern} ) { $$hash_ref_1{$text_id} .= ':'.$high_lvl_id; foreach my $part (split(/\s/,$pattern)) { $$hash_ref_2{$text_id} -> {$part} = 0; } } } } }

    Update: added code
      Again, thanks all for your suggestions. To avoid extra work I think I'll first try to upgrade to version 10.0. The only problem is that my system administrator is away for four weeks, so I'll have to wait for that... Frustrating indeed.
        I thought I had already written this, weird...

        Working along moritz' suggestions of using larger regexes, in combination with some additional pre-processing and an upgrade to v.10.0, I am now down to 2.5 minutes on my local machine. Thus I am happy, and you do not need to post any more in this thread for the sake of helping me out.

        Thanks all of you who wrote here.