in reply to Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first)

Use index instead of a regex. Here's a benchmark (print is commented out to remove the effects from the benchmark.)

use Benchmark qw(cmpthese); my @strings = ( 'Some string about this long or so, maybe this long', 'I like pizza this long or so, maybe this long', 'this long or so, maybe this French Fries long', 'This Sugar Rush Rocks. maybe this do not stop the clock.', ); my %hash = ( 'Sugar Rush Rocks' => 'whatever', 'long' => 'itsgood', 'this long' => 'ilikeit', 'maybe this' => 'itsokay', 'Some String' => 'loooveit' ); my @keys_sorted_by_length_desc = sort { length $b <=> length $a } keys %hash; cmpthese( -5, { 'Regex' => \&use_regex, 'Index' => \&use_index, } ); sub use_regex { foreach my $string (@strings) { foreach my $key (@keys_sorted_by_length_desc) { my $key_re = quotemeta($key); if ( $string =~ /(:?\A|\s)($key_re)\s*/i ) { #print "Found '$key' in '$string'\n"; last; } } } } sub use_index { foreach my $string (@strings) { foreach my $key (@keys_sorted_by_length_desc) { my $lcstring = lc $string; my $lckey = lc $key; if ( ( index $lcstring, $lckey ) > -1 ) { #print "Found '$key' in '$string'\n"; last; } } } } __END__ Rate Regex Index Regex 7539/s -- -86% Index 52227/s 593% --

Update: move generating the sorted list out of the subroutines. Even better.

          Rate Regex Index
Regex   8311/s    --  -92%
Index 101399/s 1120%    --
  • Comment on Re: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first)
  • Download Code

Replies are listed 'Best First'.
Re^2: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first)
by ikegami (Patriarch) on May 14, 2010 at 18:52 UTC

    You're not comparing regex vs index, you're comparing a buggy and inefficient implementation of regex with a non-equivalent implementation using index.

    • "(:?" should be "(?:"
    • You compile the same regex repeatedly.
    • You use captures where none are needed.
    • The \s* is completely useless.
    • You check if they key is the start of a word string with the regex solution, but not with the index solution.

    Fixed:

    use strict; use warnings; use Benchmark qw(cmpthese); my @strings = ( 'Some string about this long or so, maybe this long', 'I like pizza this long or so, maybe this long', 'this long or so, maybe this French Fries long', 'This Sugar Rush Rocks. maybe this do not stop the clock.', ); my %hash = ( 'Sugar Rush Rocks' => 'whatever', 'long' => 'itsgood', 'this long' => 'ilikeit', 'maybe this' => 'itsokay', 'Some String' => 'loooveit' ); my @keys_sorted_by_length_desc = sort { length $b <=> length $a } keys %hash; cmpthese(-2, { Regex => \&use_regex, Index => \&use_index, Index2 => \&use_index2, }); sub use_regex { my %re_keys = map { $_ => qr/\b\L\Q$_\E\b/ } @keys_sorted_by_lengt +h_desc; for my $string (@strings) { my $s = lc($string); for my $key (@keys_sorted_by_length_desc) { if ( $s =~ $re_keys{$key} ) { #print "Found '$key' in '$string'\n"; last; } } } } sub use_index { my @lc_keys = map lc, @keys_sorted_by_length_desc; for my $string (@strings) { my $s = lc($string); for my $key (@keys_sorted_by_length_desc) { my $lc_key = lc($key); my $spos = index($s, $lc_key); if ($spos >= 0 && ( $spos == 0 || substr($s, $spos-1, 1) = +~ /\W/ )) { my $epos = $spos + length($lc_key); if ($epos == length($s) || substr($s, $epos, 1) =~ /\W +/) { #print "Found '$key' in '$string'\n"; last; } } } } } sub use_index2 { for my $string (@strings) { ( my $s = " \L$string " ) =~ s/\W/ /g; for my $key (@keys_sorted_by_length_desc) { my $lc_key = " \L$key "; if (index($s, $lc_key) >= 0) { #print "Found '$key' in '$string'\n"; last; } } } }

    Results:

    Rate Regex Index2 Index Regex 23206/s -- -33% -56% Index2 34824/s 50% -- -35% Index 53339/s 130% 53% --

    Update: Fixed many many problems.

      I don't disagree with your commentary about the crappy regex. It isn't mine though, it is what the OP had.

      And, while it's true that my solution has some serious shortcomings, your solution has its own set of problems. The OP said that he needs the $key value later to get values out of the hash. If it is a precompiled regex, he can't use it. You'll probably need to compile a hash of key, regex pairs, but that will raise issues of ordering again. Also, $s contains the lc string, I think you meant $_ in the print line. That's just a typo though.

        Fixed. Tested. Replaced code in original node.