Count multiple pattern matches

johnnywang has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Count multiple pattern matches by Zaxo (Archbishop) on Dec 07, 2004 at 02:04 UTC
One way is almost what you have, but with some changes, `use strict; my (@keywords, %keyword)=qw/foo bar 12345 abcd/; my ($string, %result) = "foobarfoo1234523423412345abcdefsadfabc";` [download] Precompile the regexen, `@keyword{@keywords} = map {qr/\Q$_\E/} @keywords;` Now get the count directly without any named temporary, `$result{$_} = () = $string =~ $keyword{$_} for (keys %keyword);` [download] That's no big change over what you have, but uses some idiomatic optimizations. Another way is to count within the big regex you mention. You can do that with a code construction in the re, `my @regexen = map { qr/(?:\Q$_\E(?{$result{$_}++}))/ } @keywords; my $re = do { local $" = '\|'; qr/@regexen/; };` [download] I've used my favorite tricky way of getting alternation into an array there (qr// is an interpolating quote operator). `$string =~ /$re/g; print "$_: $result{$_}\n" for @keywords;` [download] There, the regex engine should only evaluate the code part if the text has matched, and then restart the regex at pos for the next match. Untested, Update: Perl qr// dosn't seem to like running the `(?{$result{$_}++})` bit. I'm not sure why. Anybody know? A third way is to munch through the string with index for each word you want to match. It may be worthwhile to study your text before running the regex matches on it. Benchmark your different approaches, chances are that each will be best for some cases. After Compline, Zaxo	[reply] [d/l] [select]
Re^2: Count multiple pattern matches by zejames (Hermit) on Dec 07, 2004 at 10:18 UTC
I've had a look at your pretty solution (with `qr`). However, it currently does not work, for several reasons. The first in the one you mention : it does not compile, because of `qr` and `(?{...})` block. That is explained in perlre, about the `(?{...})` block : For reasons of security, this construct is forbidden if the regular expression involves run-time interpolation of variables, unless the perilous use re 'eval' pragma has been used (see re), or the variables contain results of qr// operator (see perlop/"qr/STRING/imosx"). In short, that is to prevent a external variable containing such a block to be executed in the regex engine. The solution is described in the doc : `use re 'eval'` However, some other thing does not work yet : when the regex `qr/(?:(\Q$_\E))(?{$result{$_}++})/` is matched, that does not set the `%result` hash entry, because at execution time, `$_` is not what you expect. If you look at the regex created : `print $re; __DATA__ (?-xism:(?-xism:(?:(foo))(?{$result{$_}++}))\|(?-xism:(?:(bar))(?{$resu +lt{$_}++}) )\|(?-xism:(?:(12345))(?{$result{$_}++}))\|(?-xism:(?:(abcd))(?{$result{ +$_}++})))` [download] you'll see that `$_` variable is not replaced by the keyword values. To my understanding, it is because `(?{...})` contains code that will be interpreted later, so perl does not interpret in at that time, leaving it unmodified. So, to do what we want here, I use `(...)` to catch a execution time the match and set the correct hash entry. There is a final problem : the name of the variable. `my $data = "foofdsfdsbar"; my @matches = $data =~ m/(foo)(?{print "($1,$2,$+)"})\|(bar)(?{print "($1,$2,$+)" +})/g; __DATA__ (foo,,foo)(,bar,bar)` [download] That is to say : `$+` is the only one that surely contains the last pattern matched (BTW, I know about performance penalty when using it, but I won't care for now ;) So, let's code it ! `#!/usr/bin/perl use strict; use re 'eval'; my @keywords = qw/foo bar 12345 abcd/; my ($string) = "foobarfoo1234523423412345abcdefsadfabc"; our %result; my @regexen = map { qr/(?:(\Q$_\E))(?{$result{$+}++})/ } @keywords; my $re = do { local $" = '\|'; qr/@regexen/; }; my @match = $string =~ /$re/g; print "$_: $result{$_}\n" for @keywords;` [download] -- zejames	[reply] [d/l] [select]
Re: Count multiple pattern matches by tilly (Archbishop) on Dec 07, 2004 at 02:50 UTC
Use a capturing subgroup. `use strict; my @keywords=qw/foo bar 12345 abcd/; my $string = "foobarfoo1234523423412345abcdefsadfabc"; my %result; my $pattern = join "\|", @keywords; while ($string =~ /($pattern)/ig) { $result{$1}++; } foreach my $s (keys %result){ print "$s=>$result{$s}\n"; }` [download] As an efficiency note I'm not sure whether the pattern above keeps on getting recompiled. You may wish to benchmark a few variations. It is also possible that it is faster to match many times than it is to do one big complex match - I've certainly seen that happen in the past. And you should definitely look at Regexp::PreSuf. As a bug note, if your strings are "foo", "bar", and "foobar", the original approach will correctly spot one of each in "foobar". This approach will not. (Depending on order it might spot "foobar" or it might spot "foo" and "bar".) This gets even more complicated if you're looking at things like "bazfoo" and "foobar".	[reply] [d/l]
Re: Count multiple pattern matches by bobf (Monsignor) on Dec 07, 2004 at 05:58 UTC
If the 'pretty long' string is 'long enough' and you have a lot of keywords, you might be better off using a non-regex solution. The example below creates an index of $string, so finding exact matches is fast. Using very short substrings for the index will not be very efficient, though, so if your keywords are only a few characters a regex might be better. This example uses a snippet of DNA sequence for $string and an arbitrary substring length of 3 (but that could be set based on the length of the shortest element in @keywords instead). The location of all the matches are stored in %matches, but you could speed things up by just incrementing a counter if you didn't need them for anything later. use strict; use warnings; my $string = 'ccaaactcagtggggtgaatggggcttctctgtgctctgatagcttccctaccctt +tcccttctccagctcccgtcccttctgactgtgagcagccccctcctctccactgttcccctcctgttg +tcagaggagggcccagctgaggcagggactggaccaccggctggggtgtccctaggggtcttgggtggc +tggcagtagtggagcctggggctgagaggggaagcaaaataagattgtcctccaacttagccatcctca +ggcctgctggggctatttaactggctgggcctgcatggcgacagggcccctacagcctccctgggaaca +aggggtgaagggttcagggggaagggggtcacagagtgatggagaaacctcttgagaacaaactaggct +ccctcatgctggagtccaaggctgagtacctcccttctctgaaacagagcaacaaccccactcccaccc +cgagtctgtc'; my @keywords = qw( ttc gctg ccaac ggggct ccc ); my $substrlen = 3; # or use length of shortest element in @keywords # create an index of all substrings of length $string my %substrings; for( my $i = 0; $i <= length($string) - $substrlen; $i++ ) { my $substring = substr( $string, $i, $substrlen ); push( @{ $substrings{$substring}{hits} }, $i ); } # search the index for all elements of @keywords my %matches; foreach my $keyword ( @keywords ) { my $subkey = substr( $keyword, 0, $substrlen ); if( exists $substrings{$subkey} ) { foreach my $hit ( @{ $substrings{$subkey}{hits} } ) { if( $keyword eq substr($string, $hit, length($keyword)) ) { push( @{ $matches{$keyword} }, $hit ); } } } } foreach my $keyword ( keys %matches ) { print "Found $keyword ", scalar @{ $matches{$keyword} }; printf " time%s at position%s: ", scalar @{ $matches{$keyword} } > 1 ? 's' : '', scalar @{ $matches{$keyword} } > 1 ? 's' : ''; print join( ', ', @{ $matches{$keyword} } ), "\n"; } OUTPUT Found gctg 8 times at positions: 140, 164, 192, 214, 268, 286, 408, 42 +1 Found ttc 8 times at positions: 25, 44, 55, 60, 78, 111, 344, 435 Found ccaac 1 time at position: 245 Found ccc 22 times at positions: 46, 51, 57, 70, 75, 94, 95, 96, 113, +114, 136, 174, 309, 310, 321, 401, 432, 456, 457, 463, 467, 468 Found ggggct 3 times at positions: 20, 211, 271 [download] I just thought a different approach might be interesting. YMMV. I'm sure there is a break even point between creating a regex with a lot of alternation and the time spent creating the index. I'd recommend benchmarking a few methods with some of your actual data to see what works best.	[reply] [d/l]
Re: Count multiple pattern matches by Mr. Muskrat (Canon) on Dec 07, 2004 at 02:39 UTC
It's probably not the best way to do it but this is what sprang to mind when I saw this post. It's also very similar to what you already have. `use strict; my @keywords = qw/foo bar 12345 abcd/; my $string = "foobarfoo1234523423412345abcdefsadfabc"; my $re = join('\|',@keywords); my %result; $result{$1}++ while ($string =~ /($re)/ig); foreach my $s (keys %result){ print "$s=>$result{$s}\n"; }` [download]	[reply] [d/l]
Re: Count multiple pattern matches by Light Elf (Initiate) on Dec 07, 2004 at 10:18 UTC
removing previous keywords from a string should make your algoritm faster. `... foreach my $k (@keywords){ my $matches_count = ( ($string =~ s/$k//ig ) ); ...` [download]	[reply] [d/l]
Re^2: Count multiple pattern matches by calin (Deacon) on Dec 08, 2004 at 15:22 UTC
This won't work correctly if keywords overlap. For example, consider the string `foops`. This actually matches both `foo` and `oops`. But if you remove `foo` in the first iteration, the remaining string (`ps`) won't match the keyword `oops`, thus reducing the match count for `oops` from 1 (the correct result) to 0.	[reply] [d/l] [select]
Re: Count multiple pattern matches by kappa (Chaplain) on Dec 07, 2004 at 15:01 UTC
Try Regexp::List. It converts `qw/foobar fooxar foozap fooza/` to `qr/foo(?:[bx]ar\|zap?)/` using Trie structure so that the resulting regex is much more efficient than naive `/foobar\|fooxar\|.../`. And tillys approach to counting each word by capturing the match sounds great. --kap	[reply] [d/l] [select]
Re: Count multiple pattern matches by melora (Scribe) on Dec 07, 2004 at 14:12 UTC
(sorry, just not feeling up to coding at the moment) How about this idea: put the keywords into a hash, with a zero as the hash value for each keyword. Then for each token in the input string, increment the hash value for that token (check to see whether it's in the hash first?). The non-zero hash values are then counts of how many times each keyword appears in the input string. That leaves the hunting for the tokens to the hash implementation... but you still have to look for the non-zero hash elements. Something like foreach keyword in the hash, if the value is greater than zero, spit out the keyword and the value (then set it to zero again, if this is all inside a loop)... ? The hunting-through-an-array thing is one of my old C habits that I'm trying to break. Please don't scream, just gently correct me if I'm being stupid.	[reply]
Re: Count multiple pattern matches by pijll (Beadle) on Dec 07, 2004 at 16:39 UTC
Have you tried to insert a `study $string;` outside the loop, before you do the first regex match? This should speed up the matches, especially if `$string` is very long. `perldoc -f study` [download] for more information.	[reply] [d/l]


Think about Loose Coupling
	PerlMonks