Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first)

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by moritz (Cardinal) on May 14, 2010 at 16:43 UTC
Obviously this approach scales horribly. How can I make it fast? First approach: use perl 5.10 or newer, it has a very good optimization for many alternations of literal strings. You might want to increase ${^RE_TRIE_MAXBUF} to make them all your keys fit into the same trie. There's also Regexp::Assemble, which is said to optimize the matching quite a bit, but I haven't tried it myself yet. Another obvious improvement is to move the regex generation and compilation out of the loop. Update: I've mis-read the original code. The key to speed is to assemble all keys into one regex (outside the loop), and match that. You can use named captures or `(?{...})` code blocks to identify which key actually matched. 2nd update: To make my point a bit clearer, I've written a small script which generates 5k random search phrases. `use strict; use warnings; use 5.010; use List::Util qw(max); my @letters = ('a' ... 'z', (' ') x 5); my %words = map { join('', map { $letters[int rand(@letters)] } 1 .. max(5, rand(50) +)) => $_ } 1..5_000; my @keys_sorted_by_length_desc = sort { length $b <=> length $a } key +s %words; my $re = join '\|', @keys_sorted_by_length_desc; $re = qr{($re)}i; while (<>) { chomp; if ($_ =~ $re) { say "found match $1 in '$_', $words{$1}"; } }` [download] I've applied that script to roughly 500k lines from files in /usr/share/dict/. It takes about 0.6s to generate the random phrases and the regex, and 1.5s for precessing the 500k lines. I don't know if that's fast enough for you, I wouldn't describe it as "scales horribly" anymore :-)	[reply] [d/l] [select]
Re: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) (words) by tye (Sage) on May 14, 2010 at 18:05 UTC
You are forcing Perl to recompile each regex over and over. A simple improvement would be: `# ... my %regex; @regex{ keys %hash }= map { qr/(?<!\S)\Q$_\E(?!\S)/i } values %hash; foreach my $string ( @strings ) { foreach my $key ( @keys_sorted_by_length_desc ) { if( $string =~ $regex{$key} ) { print "Found '$key' in '$string'\n"; last; } } }` [download] But you might consider matching patterns of words rather than strings: #!/usr/bin/perl -w use strict; my @strings = ( 'Some string about this long or so, maybe this long', 'I like pizza this long or so, maybe this long', 'this long or so, maybe this French Fries long', 'This Sugar Rush Rocks. maybe this do not stop the clock.', ); my @repl = ( [qw< Sugar Rush Rocks >], 'whatever', 'long', 'itsgood', [qw< this long >], 'ilikeit', [qw< maybe this >], 'itsokay', [qw< Some String >], 'loooveit', ); my %repl; while( @repl ) { my $word= shift @repl; my $repl= shift @repl; my $len; if( ! ref $word ) { $repl= [ $repl, length($word) ]; } else { my $len= length join ' ', @$word; my $first= shift @$word; $repl= [ $repl, map( lc $_, @$word ), $len ]; $word= $first; } push @{ $repl{ lc $word } }, $repl; } for my $list ( values %repl ) { @$list= sort { $b->[-1] <=> $a->[-1] } @$list; pop @$_ for @$list; } STRING: foreach my $string ( @strings ) { my @words= $string =~ /(\S+)/g; my $i= 0; while( $i < @words ) { my $word= lc $words[$i]; next if ! $repl{$word}; for my $list ( @{ $repl{$word} } ) { my( $repl, @next )= @$list; next if grep $next[$_] ne lc( $words[$i+1+$_] \|\| '' ), 0.. +$#next; print "Found '$word @next' in '$string'\n"; next STRING; } } continue { $i++; } } [download] - tye	[reply] [d/l] [select]
Re: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by thundergnat (Deacon) on May 14, 2010 at 17:12 UTC
Use index instead of a regex. Here's a benchmark (print is commented out to remove the effects from the benchmark.) use Benchmark qw(cmpthese); my @strings = ( 'Some string about this long or so, maybe this long', 'I like pizza this long or so, maybe this long', 'this long or so, maybe this French Fries long', 'This Sugar Rush Rocks. maybe this do not stop the clock.', ); my %hash = ( 'Sugar Rush Rocks' => 'whatever', 'long' => 'itsgood', 'this long' => 'ilikeit', 'maybe this' => 'itsokay', 'Some String' => 'loooveit' ); my @keys_sorted_by_length_desc = sort { length $b <=> length $a } keys %hash; cmpthese( -5, { 'Regex' => \&use_regex, 'Index' => \&use_index, } ); sub use_regex { foreach my $string (@strings) { foreach my $key (@keys_sorted_by_length_desc) { my $key_re = quotemeta($key); if ( $string =~ /(:?\A\|\s)($key_re)\s*/i ) { #print "Found '$key' in '$string'\n"; last; } } } } sub use_index { foreach my $string (@strings) { foreach my $key (@keys_sorted_by_length_desc) { my $lcstring = lc $string; my $lckey = lc $key; if ( ( index $lcstring, $lckey ) > -1 ) { #print "Found '$key' in '$string'\n"; last; } } } } __END__ Rate Regex Index Regex 7539/s -- -86% Index 52227/s 593% -- [download] Update: move generating the sorted list out of the subroutines. Even better. Rate Regex Index Regex 8311/s -- -92% Index 101399/s 1120% --	[reply] [d/l]
Re^2: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by ikegami (Patriarch) on May 14, 2010 at 18:52 UTC
You're not comparing regex vs index, you're comparing a buggy and inefficient implementation of regex with a non-equivalent implementation using index. "(:?" should be "(?:" You compile the same regex repeatedly. You use captures where none are needed. The \s* is completely useless. You check if they key is the start of a word string with the regex solution, but not with the index solution. Fixed: use strict; use warnings; use Benchmark qw(cmpthese); my @strings = ( 'Some string about this long or so, maybe this long', 'I like pizza this long or so, maybe this long', 'this long or so, maybe this French Fries long', 'This Sugar Rush Rocks. maybe this do not stop the clock.', ); my %hash = ( 'Sugar Rush Rocks' => 'whatever', 'long' => 'itsgood', 'this long' => 'ilikeit', 'maybe this' => 'itsokay', 'Some String' => 'loooveit' ); my @keys_sorted_by_length_desc = sort { length $b <=> length $a } keys %hash; cmpthese(-2, { Regex => \&use_regex, Index => \&use_index, Index2 => \&use_index2, }); sub use_regex { my %re_keys = map { $_ => qr/\b\L\Q$_\E\b/ } @keys_sorted_by_lengt +h_desc; for my $string (@strings) { my $s = lc($string); for my $key (@keys_sorted_by_length_desc) { if ( $s =~ $re_keys{$key} ) { #print "Found '$key' in '$string'\n"; last; } } } } sub use_index { my @lc_keys = map lc, @keys_sorted_by_length_desc; for my $string (@strings) { my $s = lc($string); for my $key (@keys_sorted_by_length_desc) { my $lc_key = lc($key); my $spos = index($s, $lc_key); if ($spos >= 0 && ( $spos == 0 \|\| substr($s, $spos-1, 1) = +~ /\W/ )) { my $epos = $spos + length($lc_key); if ($epos == length($s) \|\| substr($s, $epos, 1) =~ /\W +/) { #print "Found '$key' in '$string'\n"; last; } } } } } sub use_index2 { for my $string (@strings) { ( my $s = " \L$string " ) =~ s/\W/ /g; for my $key (@keys_sorted_by_length_desc) { my $lc_key = " \L$key "; if (index($s, $lc_key) >= 0) { #print "Found '$key' in '$string'\n"; last; } } } } [download] Results: `Rate Regex Index2 Index Regex 23206/s -- -33% -56% Index2 34824/s 50% -- -35% Index 53339/s 130% 53% --` [download] Update: Fixed many many problems.	[reply] [d/l] [select]
Re^3: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by thundergnat (Deacon) on May 14, 2010 at 19:36 UTC
I don't disagree with your commentary about the crappy regex. It isn't mine though, it is what the OP had. And, while it's true that my solution has some serious shortcomings, your solution has its own set of problems. The OP said that he needs the $key value later to get values out of the hash. If it is a precompiled regex, he can't use it. You'll probably need to compile a hash of key, regex pairs, but that will raise issues of ordering again. Also, $s contains the lc string, I think you meant $_ in the print line. That's just a typo though.	[reply]
Re^4: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by ikegami (Patriarch) on May 14, 2010 at 22:53 UTC
Re: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by repellent (Priest) on May 14, 2010 at 17:30 UTC
`use Regexp::Assemble; use List::Util qw(reduce); # XXX why have a hash if its values are not used? my @keys = map { quotemeta } keys %hash; my $key_re = Regexp::Assemble->new->add(@keys)->re; for my $string (@strings) { my $match = reduce { length($a) > length($b) ? $a : $b } ($string =~ /\b($key_re)\b/gi); print "Found '$match' in '$string'\n" if defined $match; }` [download] Update1: Now handles multiple keys in same string. Update2: Overlapping keys are not handled well, due to the left-to-right nature of the regex match. If this assumption can be made, hopefully this approach can improve performance. YMMV.	[reply] [d/l]
Re^2: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by JavaFan (Canon) on May 14, 2010 at 17:35 UTC
This does not do what the title suggest the problem requires: finding the longest key first. If you have keys "Short" and "Longer Key", and the string is "This is a Short Key and a Longer Key", your solution will report "Short" (left most), while the OP original solution reports "Longer Key". Now, it may be that the OP is satisfied by that - but the title suggests he won't.	[reply]
Re^3: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by repellent (Priest) on May 14, 2010 at 17:58 UTC
Right. Code has been updated.	[reply]
Re^4: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by JavaFan (Canon) on May 14, 2010 at 18:01 UTC
Re^5: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by repellent (Priest) on May 14, 2010 at 19:04 UTC
Re^2: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by Anonymous Monk on May 14, 2010 at 18:03 UTC
I didn't use the hash in the example, but I do want to use it! `If (/..../) { $hash_value = $hash{$1???} }` [download]	[reply] [d/l]
Re: Matching Many Strings against a Large List of Hash Keys (case insensitively, longest key first) by JavaFan (Canon) on May 14, 2010 at 17:02 UTC
The more you know about the data, the better. For instance, if most strings will not match, you could do some preprocessing: for each string, split it on whitespace. Have a hash of all the first words of the keys of %hash; if none of the words in the string are in the hash, there cannot be a match. Obviously, this will not do you any good if all strings will have a match. Also, is it important to check the keys in a specific order? Or are you satisfied if any key matches? And why do you have %hash if you aren't using its values?	[reply]