in reply to Efficient regex matching with qr//; Can I do better?

Per ikegami's suggestion that it is advantageous to do as much work outside the loop(s) as possible, this code may be helpful (although I'm not really sure I understand the problem completely):

use warnings; use strict; use Data::Dumper; MAIN: { # begin main loop # define hashes for text and patterns. my %text = ( text_id_1 => 'four score and foo bar baz seven years', text_id_2 => 'to fee fie foe or not', text_id_3 => 'to wibble or not to wobble', text_id_4 => 'text with no patterns present', text_id_5 => 'text with fee fie foe and wobble present', # ...etc. ); my %patterns = ( 'fee fie foe' => 'hi_lvl_id_1', 'foo bar baz' => 'hi_lvl_id_2', 'wibble' => 'hi_lvl_id_3', 'wobble' => 'hi_lvl_id_4', 'quux' => 'hi_lvl_id_5', # pattern present in no text # ...etc. ); # preprocess patterns hash for data extraction. while (my ($string, $id) = each %patterns) { $patterns{$string} = { hi_lvl_id => $id, regex => qr{ \b \Q$string\E \b }xms, parts => [ split /\s/, $string ], }; } # initialize output hashes. my %pattern_ids = map { $_ => '' } keys %text; my %part_ids = map { $_ => {} } keys %text; match(\%patterns, \%text, \%pattern_ids, \%part_ids); print "%pattern_ids: \n", Dumper \%pattern_ids; print "%part_ids: \n", Dumper \%part_ids; } # end main loop # subroutines ################################## sub match { my ($hr_patterns, # ref. to input hash: patterns $hr_text, # ref. to input hash: text strings $hr_pattern_ids, # ref. to output hash: identified pattern ids $hr_part_ids, # ref. to output hash: identified part ids ) = @_; PATTERN: for my $hr_pattern (values %$hr_patterns) { TEXT: while (my($text_id, $text) = each %$hr_text) { next TEXT unless $text =~ $hr_pattern->{regex}; $hr_pattern_ids->{$text_id} .= ":$hr_pattern->{hi_lvl_id}"; my @parts = @{ $hr_pattern->{parts} }; # for convenience @{ $hr_part_ids->{$text_id} }{@parts} = (0) x @parts; } # end while TEXT loop } # end for PATTERN loop }

Output:

%pattern_ids: $VAR1 = { 'text_id_3' => ':hi_lvl_id_3:hi_lvl_id_4', 'text_id_5' => ':hi_lvl_id_1:hi_lvl_id_4', 'text_id_2' => ':hi_lvl_id_1', 'text_id_4' => '', 'text_id_1' => ':hi_lvl_id_2' }; %part_ids: $VAR1 = { 'text_id_3' => { 'wibble' => 0, 'wobble' => 0 }, 'text_id_5' => { 'foe' => 0, 'wobble' => 0, 'fie' => 0, 'fee' => 0 }, 'text_id_2' => { 'foe' => 0, 'fie' => 0, 'fee' => 0 }, 'text_id_4' => {}, 'text_id_1' => { 'bar' => 0, 'baz' => 0, 'foo' => 0 } };

Perhaps more advantageous would be moritz's suggestion to compile all the patterns into one massive alternation, but I would quail at the thought of a regex that might easily exceed a million characters. But if it works...

Replies are listed 'Best First'.
Re^2: Efficient regex matching with qr//; Can I do better?
by moritz (Cardinal) on Jul 11, 2008 at 09:36 UTC
    Perhaps more advantageous would be moritz's suggestion to compile all the patterns into one massive alternation

    I didn't suggest that, because I think I once benchmarked it, and seem to recall that at some point it degrades performance (perhaps over the weekend I'll see if I can dig out my benchmark again). I suggested to chose a number, let's say 1000 for a start, and aggregate only that many alternatives into a single regex, and loop over these assembled regexes then.

      Thanks a lot all of you. I'm sorry I didn't specify the types of pattern clearly, but both the patterns and the texts consist of literals (a-z,0-9) and single spaces only. I have only had time to briefly look at your answers. The idea of doing as much as possible outside the loop is probably a good one which I'll try. The idea of using an array and specify indexes I didn't really understand the benefit of at first glance. Thirdly, the idea to just do one (or a few) large regex matches sounds very interesting. I use ActiveState Perl v. 5.8.8 for windows, so I realize it might be worth upgrading. I don't quite know what their latest version available is. Again, thanks a lot for your efforts!
        To illustrate the speed difference I wrote a small benchmark that assembles 500 random strings of length 5 to 15 into a regex, and matches that against a random string with 1 million characters. Here's the result:
        # perl 5.10.0: timethis 10: 0 wallclock secs ( 0.33 usr + 0.00 sys = 0.33 CPU) @ 3 +0.30/s (n=10) (warning: too few iterations for a reliable count) # perl 5.8.8: timethis 10: 79 wallclock secs (78.92 usr + 0.04 sys = 78.96 CPU) @ +0.13/s (n=10)

        This is on Linux, but I guess the results on Windows are similar. You see that in this case perl5.10.0 takes less than a second, while perl5.8.8 takes over a minute.

        I also tried it with a shorter target (100 chars) and more iterations, and the speed differences are similar.

        The idea of using an array and specify indexes I didn't really understand the benefit of at first glance.
        The idea is to eliminate as much code as possible specifically from the parts that run for each pattern and each text string. This means this bit:
        while (my($text_id,$text) = each(%hash2) ) { if ($text =~ $match)
        Which, for each iteration, does this:
        • each returns the key and value
        • those are copied in a list assigment
        • the match is attempted
        Both the copy and the overhead of the list assignment (which is not insignificant, when repeated a billion times) aren't necessary.

        If the text strings are unique, such that you can determine the $text_id given $text, you can leave out the array stuff and just do this:

        my @text = values %hash2; my %reverse_hash2 = reverse %hash2; while ( my($pattern,$high_lvl_id) = each(%hash1) ) { my $match = qr/\b$pattern\b/; for my $text (@text) { if ( $text =~ $match ) { my $text_id = $reverese_hash2{$text}; ... } } }
        Because for (@array) doesn't copy array but just iterates over it, and $text is aliased to each element in turn, not copied over it, this should be faster.
Re^2: Efficient regex matching with qr//; Can I do better?
by Anonymous Monk on Jul 12, 2008 at 00:21 UTC
    Note that the code given above has a (potential) bug: if a pattern is present two or more times in a text string, it will only be seen and processed once.

    Here is some code that does not have this bug (if, indeed, it is a bug) and that also should run faster. The (unbenchmarked) speed increase is expected because all text strings are concatenated into a single string and this string is searched for patterns by each regex, rather than having every individual text string searched by each regex.

    (This approach would probably mesh nicely with moritz's technique of concatenating large groups of patterns together as alternations.)

    =comment ASSUMPTIONS: - all text strings and patterns are composed of alphanumeric characters separated by spaces; - all pattern strings are unique (this is implied by the fact that they are used as keys in a hash). this approach will work if all the text strings can be concatenated together into a string that will fit into memory. (if not, relentless disk thrashing will ensue, along with an orders-of-magnitude speed penalty.) =cut use warnings; use strict; MAIN: { # begin main loop # define hashes for text and patterns. my %text = ( text_id_1 => 'four score and foo bar baz seven years', text_id_2 => 'to fee fie foe or not', text_id_3 => 'to wibble or not to wobble', text_id_4 => 'text with no patterns present', text_id_5 => 'text with fee fie foe and wobble present', text_id_6 => 'text with two wobble present wobble', # ...etc. ); # separator char not found in any text string or pattern string. my $sep = ':'; # anything not a separator is a text string or pattern string char. my $non_sep = qr{ [^$sep] }xms; # concatenation of all text strings. CAUTION: size! # also build hash associating text string endpoints with text ids. my $all_text = ''; my %text_endpoints; for my $text_id (keys %text) { $all_text .= $text{$text_id}; $text_endpoints{ length $all_text } = $text_id; $all_text .= $sep; } my %patterns = ( 'fee fie foe' => 'hi_lvl_id_1', 'foo bar baz' => 'hi_lvl_id_2', 'wibble' => 'hi_lvl_id_3', 'wobble' => 'hi_lvl_id_4', 'quux' => 'hi_lvl_id_5', # pattern present in no text # ...etc. ); # preprocess patterns hash for data extraction. while (my ($string, $id) = each %patterns) { $patterns{$string} = { hi_lvl_id => $id, parts => [ split /\s/, $string ], # capturing parentheses wrapped in non-capturing # look-ahead allows overlapping pattern capture. # this would be critical in the case of more than one # occurrence of a pattern string in a text string. regex => qr{ (?= ( \Q$string\E $non_sep* ) ) }xms, }; } # initialize output hashes. (this could be a single hash.) my %pattern_ids = map { $_ => '' } keys %text; my %part_ids = map { $_ => {} } keys %text; match(\$all_text, \%patterns, \%text_endpoints, \%pattern_ids, \%part_ids); use Data::Dumper; print "%pattern_ids: \n", Dumper \%pattern_ids; print "%part_ids: \n", Dumper \%part_ids; } # end main loop # subroutines ################################## sub match { my ($sr_all_text, # ref. to input scalar: all strings together $hr_patterns, # ref. to input hash: patterns $hr_text_endpoints, # ref. to input hash: text endpoints $hr_pattern_ids, # ref. to output hash: identified pattern ids $hr_part_ids, # ref. to output hash: identified part ids ) = @_; PATTERN: for my $hr_pattern (values %$hr_patterns) { TEXT: while (${$sr_all_text} =~ m{ $hr_pattern->{regex} }xmsg) { # get text id from text endpoint in @+ array. my $text_id = $hr_text_endpoints->{$+[1]}; $hr_pattern_ids->{$text_id} .= ":$hr_pattern->{hi_lvl_id}"; my @parts = @{ $hr_pattern->{parts} }; # for convenience @{ $hr_part_ids->{$text_id} }{@parts} = (0) x @parts; } # end while TEXT loop } # end for PATTERN loop }
    Output:
    %pattern_ids: $VAR1 = { 'text_id_3' => ':hi_lvl_id_3:hi_lvl_id_4', 'text_id_5' => ':hi_lvl_id_1:hi_lvl_id_4', 'text_id_2' => ':hi_lvl_id_1', 'text_id_4' => '', 'text_id_6' => ':hi_lvl_id_4:hi_lvl_id_4', 'text_id_1' => ':hi_lvl_id_2' }; %part_ids: $VAR1 = { 'text_id_3' => { 'wibble' => 0, 'wobble' => 0 }, 'text_id_5' => { 'foe' => 0, 'wobble' => 0, 'fie' => 0, 'fee' => 0 }, 'text_id_2' => { 'foe' => 0, 'fie' => 0, 'fee' => 0 }, 'text_id_4' => {}, 'text_id_6' => { 'wobble' => 0 }, 'text_id_1' => { 'bar' => 0, 'baz' => 0, 'foo' => 0 } };
      Thanks. Well, it isn't a bug. I am not interested in the multiplicity of the matching.