Problems counting regex matches

elcilorien has asked for the wisdom of the Perl Monks concerning the following question:

I'm having difficulties counting the number of matches to some regular expressions. I know you can put the regex in list context, and then convert the list to a scalar (http://stackoverflow.com/questions/1849329/is-there-a-perl-shortcut-to-count-the-number-of-matches-in-a-string), for example using the goatse operator, =()=, but this doesn't seem to be working with my particular regular expression.

In the example below, I'm searching a string to see if either revenue(s), sales or growth occur within three words of the word currency or the phrase "foreign exchange." I cannibalized this regex from this website giving an example of implementing "near" in perl: http://www.regular-expressions.info/near.html.

The problem that I'm running into is that I cannot for the life of me accurately count the number of matches of my regex. For example, when I test a text file containing only the words

foreign exchange revenue
currency revenue
[download]

I find EIGHT matches. My own intuition and a test run in RegexBuddy show only TWO matches. I don't get any errors from Perl. But when I output my matches to a list and print them, these are the "matches" I get: (each match is in between *'s)

1 **
2 **
3 *foreign exchange*
4 *revenue*
5 **
6 **
7 *currency*
8 *revenue*

I'm getting several empty matches, and then some other matches that don't even match the whole phrase that should be matched. I can count simple regexes just fine, but somehow my convoluted "near" expression is messing things up. I keep trying to fiddle with the regex, but nothing I've tried has worked. I am willing to admit that I am not an expert programmer, and this is beyond my abilities at this point.

use strict;
use warnings;

my $FX_growth;

    if ($text=~/\b(?:(revenues?|sales|growth)\W+(?:\w+\W+){0,4}?(curre
+ncy|foreign\Wexchange)|(currency|foreign\Wexchange)\W+(?:\w+\W+){0,4}
+?(revenues?|sales|growth))\b/i)
        {
            $FX_growth =()= $text =~ /\b(?:(revenues?|sales|growth)\W+
+(?:\w+\W+){0,4}?(currency|foreign\Wexchange)|(currency|foreign\Wexcha
+nge)\W+(?:\w+\W+){0,4}?(revenues?|sales|growth))\b/gi;
        } else {
            $FX_growth=0;
            }
[download]

Comment on Problems counting regex matches Select or Download Code

Replies are listed 'Best First'.
Re: Problems counting regex matches by Eily (Monsignor) on Jan 15, 2014 at 17:51 UTC
First advice: you can add spaces and comments to make long regexes easier to read with the /x modifier. Second, instead of /regex/ you can write m<regex> (see perlop). Or you can save a regex in a variable using qr use Data::Dumper; $regex = qr< \b (?: # Non capturing group ## Case 1: currency\|foreign exchange comes second (revenues?\|sales\|growth) # group 1 \W+ (?:\w+\W+){0,4}? # Non capturing group (currency\|foreign\Wexchange) # group 2 \| ## Case 2: currency\|foreign exchange comes first (currency\|foreign\Wexchange) # group 3 \W+ (?:\w+\W+){0,4}? # Non capturing group (revenues?\|sales\|growth) # group 4 ) \b >x; $text = <<END_OF_TEXT; foreign exchange revenue currency revenue END_OF_TEXT print Dumper [ $text =~ /$regex/gi ] [download] Now, there are 4 groups, and you get four times the number of matches. It's simply because a /g regex (according to Regexp Quote Like Operators): In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern. So your 8 elements list is actually the 4 groups for the first match, followed by the 4 groups in the second match. In both matches you have string in group 3 and 4, because (currency\|foreign exchange) comes first. If you just want to count the matches, without getting the word that matched, just turn all capturing parentheses (text) into non capturing ones (?:text) : just try my exemple first as is, and then by modifying the parentheses. If you want to know which word matches, the more beginner-friendly way I can think of is to loop on iterations of the regex and read either group2 or group4. Do know that this won't work for all cases though, if you have two matching words in the neighbourhood of the same (currency\|foreign exchange), just one will be counted. For exemple in "Currency revenue sales growth", you'll just get "revenue" because the next match attempt will start after "revenue" and "currency" won't be visible anymore.	[reply] [d/l]
Re^2: Problems counting regex matches by AnomalousMonk (Archbishop) on Jan 15, 2014 at 18:50 UTC
Just to clarify, when used in an alternation in list context, a capture group always returns something if it matches or not. If it does not match, undef is returned. One way to filter out these `undef`s is with a grep. `>perl -wMstrict -MData::Dump -le "my $s = 'AAA BBB CCC DDD AAA BBB'; ;; my @captures = $s =~ m{ (AAA) \| (BBB) \| (XXX) \| (YYY) }xmsg; dd \@captures ;; my @matches = grep defined, $s =~ m{ (AAA) \| (BBB) \| (XXX) \| (YYY) }x +msg; dd \@matches; " [ "AAA", undef, undef, undef, undef, "BBB", undef, undef, "AAA", undef, undef, undef, undef, "BBB", undef, undef, ] ["AAA", "BBB", "AAA", "BBB"]` [download] Update: Another way to capture only matches is with the "branch reset" extended pattern (see `"(?\|pattern)"` in Extended Patterns in perlre) available with Perl version 5.10+. `>perl -wMstrict -MData::Dump -le "use 5.010; ;; my $s = 'AAA BBB CCC DDD AAA BBB'; ;; my @matches = $s =~ m{ (?\| (AAA) \| (BBB) \| (XXX) \| (YYY) ) }xmsg; dd \@matches " ["AAA", "BBB", "AAA", "BBB"]` [download]	[reply] [d/l] [select]
Re^2: Problems counting regex matches by AnomalousMonk (Archbishop) on Jan 16, 2014 at 00:05 UTC
... this won't work for all cases though, if you have two matching words in the neighbourhood of the same (currency\|foreign exchange), just one will be counted. For exemple in "Currency revenue sales growth", you'll just get "revenue" ... The following works for overlapping matches. It also needs 5.10+ because in addition to `(?\|pattern)`, it uses `(*FAIL)` from the Special Backtracking Control Verbs introduced in that version. The variation that only counts occurrences may be a little faster. Read more... (1429 Bytes) Output: Read more... (674 Bytes)	[reply] [d/l] [select]
Re: Problems counting regex matches by AnomalousMonk (Archbishop) on Jan 15, 2014 at 19:39 UTC
Just a stylistic note in addition to those of Eily above: It is uselessly verbose to test for a match against a regex and then conditionally count the number of matches by doing another match against the exact same regex (or else setting the count to zero) as in the OPed code. Just counting valid matches (i.e., somehow avoiding those extraneous captures) is enough: if there is no match at all, the count will be 0. `>perl -wMstrict -le "use 5.010; ;; my $regex = qr{ (?\| (AAA) \| (BBB) \| (XXX) \| (YYY) ) }xms; ;; my $s = 'AAA BBB CCC DDD AAA BBB'; my $n_matches =()= $s =~ m{ $regex }xmsg; print qq{$n_matches matches}; ;; $s = 'foo bar baz boff'; $n_matches =()= $s =~ m{ $regex }xmsg; print qq{$n_matches matches}; " 4 matches 0 matches` [download]	[reply] [d/l]
Re: Problems counting regex matches by InfiniteSilence (Curate) on Jan 15, 2014 at 22:01 UTC
"...this is beyond my abilities at this point..." Scale the problem back to your actual abilities and start from there. If you are confused by complex regular expressions (like I often am), break the problem down into simpler ones until you are more familiar (note: this example only finds words that are near but appear after your search term): use strict; =pod ...if either revenue(s), sales or growth occur within three words of +the word currency or the phrase "foreign exchange." =cut my @sources = ('foreign exchange','currency'); my $sourceData = <<EOF; foreign exchange revenue wordiness happycat smiles currency revenue world cat blue runny a nice happy foreign exchange said that revenues would be up the day I last visited my foreign exchange they said revenues were goo +d wow currency makes good growth when currency came revenues dipped EOF my %searchvector = map {$_=>$_} qw\|revenue sales growth\|; my $cntFound = 0; for (@sources) { while($sourceData=~m/$_\s+(\w+)\s(\w+)?\s(\w+)?\s/g) #does the w +ord appear? { # is anything in the searchvector in the found words? for ($1,$2,$3) { my $r = $_; $r=~s/revenues/revenue/; if($searchvector{$r}){++$cntFound}; } } } print qq\|Total times search (\| . (join ' ', (values %searchvector)) . +') found: ' . $cntFound . qq\|\n\|; 1; [download] `Total times search (growth sales revenue) found: 6` [download] Celebrate Intellectual Diversity*	[reply] [d/l] [select]