redhotpenguin has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I was asked by someone today how I would determine the count of words within a document for each word. I struggled to find a good solution, then in that nether-realm of reality when I was ready to give up, the following solution came upon me. How would you have solved this problem? (Assume the count is case-insensitive)

UPDATE: To clarify I'm trying to get a count of each word in the document, not just a total word count.

#!/usr/bin/env perl use strict; use warnings; use Data::Dumper qw( Dumper ); my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye +s I did"; my @words = split( /\W+/, $string); my %count; foreach my $word ( @words ) { $count{lc($word)}++; }; print "Word count: ", Dumper(%count); 1;

Replies are listed 'Best First'.
Re: Word incidence count
by thundergnat (Deacon) on May 17, 2005 at 15:06 UTC

    Something which none of the other solutions posted here so far takes into account, is words with internal apostrophes. (don't won't, can't shouldn't, you'll, it's, etc.) And most don't deal with non-ascii characters.

    This does:

    ########################################################## #! /usr/bin/perl use warnings; use strict; my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %count; while (<DATA>) { my $line = lc $_; while ($line =~ /($word('$word)?)/g){ $count{$1}++; } } printf "%15s %5d\n", $_, $count{$_} for sort keys %count; __DATA__ "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos."
Re: Word incidence count
by holli (Abbot) on May 17, 2005 at 07:47 UTC
      That reply is a very good solution. Noted the use of the character class to eliminate possible counting alpha numeric words such as this123, which I did not explicitly specify how they should be handled in my original post.
Re: Word incidence count
by Skeeve (Parson) on May 17, 2005 at 07:56 UTC
    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper qw( Dumper ); my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye +s I did"; # No change up to here # then just map ++$_,@count{split /\W+/,uc $string}; # that's it print "Word count: ", Dumper(%count); 1;
    Update: forgot the case-insensitiveness and changed the string to upper case

    $\=~s;s*.*;q^|D9JYJ^^qq^\//\\\///^;ex;print
Re: Word incidence count
by Joost (Canon) on May 17, 2005 at 07:44 UTC
      That leaves all the punctuation and other non-word characters and will tamper the results. (e.g. counting "word" and "word," as two different entities.)


      holli, /regexed monk/
      how about this then? it splits on any non-word character, and takes acount for case.
      my %counts; while (<>) { $counts{ lc $_ }++ for split /\W+/; } print "$_: $counts{$_}\n" for sort keys %counts;
Re: Word incidence count
by perlsen (Chaplain) on May 17, 2005 at 07:50 UTC
    Hai just try this
    $str='Hello World!\n Oh poor Yorick, his world I knew well yes I did' +; $n=$str=~s#\w+#$&#gsi; print $n; or @n=$str=~m#\w+#gsi; print scalar(@n);
Re: Word incidence count
by ishnid (Monk) on May 17, 2005 at 10:42 UTC
    I remember having a similar problem posed as a fun challenge on another forum. The challenge was to extract word counts and print them sorted by frequency, using the fewest lines of code. Naturally, Perl won that one by a mile. This was the solution that was proposed (called by `perl script.pl < filename.txt'). It's ugly as hell but it's fun!
    print "$_ = $count{$_}\n" for sort { $count{$b} <=> $count{$a} } grep +/\w/, map { lc $_ if (!$count{lc $_}++) } split( /[\W_]/ , join (' ' +, <>));
Re: Word incidence count
by blazar (Canon) on May 17, 2005 at 08:17 UTC
    #!/usr/bin/env perl
    You know, there have been several discussions as to wether this really is more system-independent than, say, /usr/bin/perl is and basically it's very much like a flame war. This is not terribly perl-specific, but sometimes it comes around...
    use strict; use warnings;
    Goood!
    use Data::Dumper qw( Dumper ); my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye +s I did"; my @words = split( /\W+/, $string);
    Well, no harm done, but split is for... ehm... splitting. Here rather than matching on non-words to discard them you may want to match on word to gather them":
    my @words = /\w+/g;
    <SNIP>
    print "Word count: ", Dumper(%count); 1;
    Huh?!? This is not needed, by any means. It's used in modules - for a well defined reason, not relevant here. Yours is simply a script...

    All in all, well done!

      Well, no harm done, but split is for... ehm... splitting.
      This usage of split is perfectly valid and appropriate.
      my @words = /\w+/g;
      Shouldn't that be @words =~ /\w+/g; ?

      Which of them is faster depends on the data.


      holli, /regexed monk/
        If you're talking about a plain text document:
        print `wc -w /path/to/file.txt`;
        cLive ;-)
        my @words = /\w+/g;
        Shouldn't that be @words =~ /\w+/g; ?
        No. But indeed it is implictly assuming that the string to be matched is in $_ which is where it usually is in my code, but which is not the case for the OP's example, actually. But the cure is really lightweight, however:
        @words = $string =~ /\w+/g;
        Which of them is faster depends on the data.
        Please note that I didn't speak of speed. I was talking about the conceptual terseness of the concept of saying what it is that you want as opposed to that of saying what it is that is not among the stuff that you do not want.
Re: Word incidence count
by sh1tn (Priest) on May 17, 2005 at 07:36 UTC
    ... my $counter; $counter++ while $string =~ /\w+/g; print "Word count: $counter\n"; ...


Re: Word incidence count - the AWKward way
by polettix (Vicar) on May 17, 2005 at 08:41 UTC
    Just a flashback from the past...
    #!/usr/bin/awk -f # wordfreq.awk --- print list of word frequencies { $0 = tolower($0) # remove case distinctions # remove punctuation gsub(/[^[:alnum:]_[:blank:]]/, "", $0) for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] }

    Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')

    Don't fool yourself.
Re: Word incidence count
by ghenry (Vicar) on May 17, 2005 at 10:19 UTC
Re: Word incidence count
by gube (Parson) on May 17, 2005 at 07:41 UTC

    Hi try this

    my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye +s I did"; ($err) = $string =~ s#(\w+)#$1#g; print $err;

    Update: Using undef $/, you read all the text in one scalar and only word you replace by word and store in the variable. You may get the total count of the word of the document.

    Regards,

    s,,aaagzas3uzttazs444ss12b3a222aaaamkyae,s,,y,azst1-4mky,,d&&print