cez has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to count the frequency of an entire list of words within a string. I don't need to know how many times the individual words occur, only the general relevancy based on the word list.

split, while, what's the preferred method? Any creative ideas?

2004-12-03 Edited by Arunbear: Changed title from 'wordcounting', as per Monastery guidelines

  • Comment on Counting the frequency of words in a string

Replies are listed 'Best First'.
Re: Counting the frequency of words in a string
by saintmike (Vicar) on Dec 02, 2004 at 21:52 UTC
    perldoc perlfaq6:
    =head2 How can I print out a word-frequency or line-frequency summary? To do this, you have to parse out each word in the input stream. We'l +l pretend that by word you mean chunk of alphabetics, hyphens, or apostrophes, rather than the non-whitespace chunk idea of a word given in the previous question: while (<>) { while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'" $seen{$1}++; } } while ( ($word, $count) = each %seen ) { print "$count $word\n"; } If you wanted to do the same thing for lines, you wouldn't need a regular expression: while (<>) { $seen{$_}++; } while ( ($line, $count) = each %seen ) { print "$count $line"; } If you want these output in a sorted order, see L<perlfaq4>: ``How do +I sort a hash (optionally by value instead of key)?''.
Re: Counting the frequency of words in a string
by gaal (Parson) on Dec 02, 2004 at 21:54 UTC
    You could use split, but if your string is English you should probably use Lingua::EN::Splitter or Lingua::EN::Segmenter::TextTiling to split it into words as this has more logic in it to treat apostrophes etc. that may be in a legitimate single word.

    Then use a hash to store counts.

Re: Counting the frequency of words in a string
by TedPride (Priest) on Dec 03, 2004 at 00:29 UTC
    Simplest way is to search the text for each word in the string, since this doesn't require counting all the words in the text. Depending on how many words you're looking for, you can probably also speed things up by lowercasing the text before the search:
    use strict; use warnings; my ($text, $find, $c); read(DATA, $text, 1024); $text = lc($text); $find = 'cried louder and louder'; for (split / +/, $find) { $c++ while $text =~ /$_/g; } print $c; __DATA__ Now it so happened that on one occasion the princess's golden ball did not fall into the little hand which she was holding up for it, but on to the ground beyond, and rolled straight into the water. The king's daughter followed it with her eyes, but it vanished, and the well was deep, so deep that the bottom could not be seen. At this she began to cry, and cried louder and louder, and could not be comforted. And as she thus lamented someone said to her, "What ails you, king's daughter? You weep so that even a stone would show pity."
    Just a rough draft. I'll let other people worry about preprocessing of the search string, whether to search based with or without word boundaries (I prefer not), etc.
Re: Counting the frequency of words in a string
by Happy-the-monk (Canon) on Dec 02, 2004 at 21:58 UTC

    I think the hardest part would be to define what makes "a word".

    Given the definition was everything that doesn't match \W, this could do it:

    my ( %wordcount, $relevancy ); for ( split /\W+/, $string ) # gets a list of all words { $wordcount{$_}++; } foreach my $word ( @list_of_words ) { $relevancy += $wordcount{$word}; }

    I know there are smarter ways, for this can consume a lot of memory. Read on... =)

    Cheers, Sören

      how about:

      use Regexp::List, put the generated regex into $word_regex

      my $count = ($count = $text) =~ s/$word_regex//g;

      horrible?

        oh, i guess that could be just

        my $count = $text =~ s/$word_regex//g;

        (since destroying $text doesn't matter in my case)

Re: Counting the frequency of words in a string
by cez (Novice) on Dec 03, 2004 at 19:48 UTC
    thanks for the answers.

    some people missed the "dont need to count individual words" part, but for the sake of future reference, this could be done like so:

    my %count; $count{$1}++ while $text =~ /($keyword_regex)/g;

    and you could find some percentage of relevancy with:

    use POSIX qw(ceil); print ceil(keys(%count) / (@keyword_list) * 100);