=head2 How can I print out a word-frequency or line-frequency summary?
To do this, you have to parse out each word in the input stream. We'l
+l
pretend that by word you mean chunk of alphabetics, hyphens, or
apostrophes, rather than the non-whitespace chunk idea of a word given
in the previous question:
while (<>) {
while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'"
$seen{$1}++;
}
}
while ( ($word, $count) = each %seen ) {
print "$count $word\n";
}
If you wanted to do the same thing for lines, you wouldn't need a
regular expression:
while (<>) {
$seen{$_}++;
}
while ( ($line, $count) = each %seen ) {
print "$count $line";
}
If you want these output in a sorted order, see L<perlfaq4>: ``How do
+I
sort a hash (optionally by value instead of key)?''.
| [reply] [d/l] |
| [reply] |
Simplest way is to search the text for each word in the string, since this doesn't require counting all the words in the text. Depending on how many words you're looking for, you can probably also speed things up by lowercasing the text before the search:
use strict;
use warnings;
my ($text, $find, $c);
read(DATA, $text, 1024); $text = lc($text);
$find = 'cried louder and louder';
for (split / +/, $find) {
$c++ while $text =~ /$_/g;
} print $c;
__DATA__
Now it so happened that on one occasion the princess's golden ball
did not fall into the little hand which she was holding up for it,
but on to the ground beyond, and rolled straight into the water. The
king's daughter followed it with her eyes, but it vanished, and the
well was deep, so deep that the bottom could not be seen. At this
she began to cry, and cried louder and louder, and could not be
comforted. And as she thus lamented someone said to her, "What ails
you, king's daughter? You weep so that even a stone would show pity."
Just a rough draft. I'll let other people worry about preprocessing of the search string, whether to search based with or without word boundaries (I prefer not), etc. | [reply] [d/l] |
I think the hardest part would be to define what makes "a word".
Given the definition was everything that doesn't match \W, this could do it:
my ( %wordcount, $relevancy );
for ( split /\W+/, $string ) # gets a list of all words
{
$wordcount{$_}++;
}
foreach my $word ( @list_of_words )
{
$relevancy += $wordcount{$word};
}
I know there are smarter ways, for this can consume a lot of memory. Read on... =)
Cheers, Sören | [reply] [d/l] |
how about:
use Regexp::List, put the generated regex into $word_regex
my $count = ($count = $text) =~ s/$word_regex//g;
horrible?
| [reply] |
| [reply] |
thanks for the answers.
some people missed the "dont need to count individual words" part, but for the sake of future reference, this could be done like so:
my %count;
$count{$1}++ while $text =~ /($keyword_regex)/g;
and you could find some percentage of relevancy with:
use POSIX qw(ceil);
print ceil(keys(%count) / (@keyword_list) * 100);
| [reply] |