abhishes has asked for the wisdom of the Perl Monks concerning the following question:

Hello All,

I want to count the number of occurrances of the word "the" in the string "The quick brown fox jumps over the lazy dog."
I want to do this using regular expressions. I wrote the following program which works correctly for characters but is producing weird results for words.
use warnings; use strict; my $str = "The quick brown fox jumps over the lazy dog"; my $count = ($str =~ tr/(the)//); print "$count\n";
The answer getting printed is 6!!! but the answer should be 1. (or 2 if I can make tr/// ignore the case).
regards,
Abhishek.

Replies are listed 'Best First'.
Re: counting the number of occurrances of a word using regex
by Ovid (Cardinal) on Dec 03, 2002 at 06:10 UTC

    And in the spirit of TIMTOWTDI:

    my $string = "The quick brown fox jumps over the lazy dog."; my $count = 0; $count++ while $string =~ /(the)/gi;

    This is useful if, for example, you want to operate on a particular occurence of a match. For example, if you wanted to grab the second occurence of particular item in every line of log file:

    while (<IN_FILE>) { my $count = 0; MATCH: while (/(\Q$some_string\E)/gi) { $count++; if ( 2 == $count ) { # do something last MATCH; } } }

    Cheers,
    Ovid

    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)

      There's no point in putting parenthesis around the search pattern, is there? I find it remarkably that 3 out of the 5 followups so far put parenthesis around the search pattern, without using $1.

      Abigail

        You're right. I was just a-cuttin' and a-pastin' and not paying attention. That's a very bad habit of mine. Thanks for the reminder.

        Cheers,
        Ovid

        New address of my CGI Course.
        Silence is Evil (feel free to copy and distribute widely - note copyright text)

Re: counting the number of occurrances of a word using regex
by BrowserUk (Patriarch) on Dec 03, 2002 at 06:06 UTC

    Your problem is that tr/// is not a regex operator in the true sense. It always operates character by character.

    What you have actually asked perl to do with the line

    my $count = ($str =~ tr/(the)//);

    • Inspect the variable $str,
    • Look for any of the characters '(', 't', 'h', 'e',')', and if it finds them, as the replacement list is empty, just count them.
    • return a count of the number of characters found in $str that were in the searchlist.

    One way to count the occurances of a given word in a string would be to use the m// operator with the /g option and force a list context as in

    my $str = "The quick brown fox jumps over the lazy dog"; my $count = () = $str =~ m/(the)/ig; print $count;

    will print 2.

    However, that is still not quite right as using it on the string 'There are three theatres in the town' and it will print 3! This is because the regex /(the)/ will also match the first 3 chars of 'There' and 'theatre'. To ensure that you will only match whole words you can bracket the work with \b - 'word boundary zero-width assertions' like this

    my $str = "There are three theatres in the town"; my $count = () = $str =~ m/(\bthe\b)/ig; print $count;

    which will correctly print 1.

    my $str = "The quick brown fox jumps over the lazy dog"; my $count = () = $str =~ m/(\bthe\b)/ig; print $count;

    which will correctly print 2. (Note: the /i modifier to make the match case independant.)


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: counting the number of occurrances of a word using regex
by graff (Chancellor) on Dec 03, 2002 at 05:40 UTC
    The "tr///" operator only works on individual characters, not on strings. To do what you want, you need to evaluate the "m//" operator in a list (actually, array) context, and then count the elements in the resulting array, thusly:
    my $str = "The quick brown fox jumped over the lazy dog"; my @the = ( $str =~ /\bthe\b/gi ); print scalar @the, $/;
    That prints "2".

    update: It would be clearer (I hope) to say that "tr///" treats every character in the left-hand side as a member of a character class; it cannot treat any sequence of characters as a contiguous string to be matched; only the "m//" operator (and the "s///" operator) can do that. Look carefully at the "perlop" man page for more complete descriptions of these three operators.

Re: counting the number of occurrances of a word using regex
by Enlil (Parson) on Dec 03, 2002 at 05:52 UTC
    tr does not do what you think it is doing. It does not even start up the regex engine (so you are not even using a regular expression). What is does is transliteration (exchanges each occurance of a character in the searchlist with the corresponding character from the replacement list.
    (i.e. tr/SEARCHLIST/REPLACEMENTLIST/).
    You are probably looking for something more along these lines:
    use warnings; use strict; my $str = "The quick brown fox jumps over the lazy dog"; my $count = ($str =~ s/(the)/$1/gi); print "$count\n";
    Though I am sure there are more elegant ways to do this. The 6 comes from:

    1. h in The
    2. e in The
    3. e in over
    (4,5,6) t,h,e in the

    which are characters occuring in the Replacement list.

    -enlil

Re: counting the number of occurrances of a word using regex
by djantzen (Priest) on Dec 03, 2002 at 05:41 UTC

    Probably tr isn't what you want here, and there's no need to capture the word, so no parentheses. A simple solution is:

    my @count = $str =~ /\bthe\b/gi; print scalar @count, "\n";

    The idea is that g searches globally while i makes the search case-insensitive. Assigning the results to an array means that the @count is populated with all of the successful matches. Placing the array in scalar context will return the number of matches.