Word incidence count

redhotpenguin has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Word incidence count by thundergnat (Deacon) on May 17, 2005 at 15:06 UTC
Something which none of the other solutions posted here so far takes into account, is words with internal apostrophes. (don't won't, can't shouldn't, you'll, it's, etc.) And most don't deal with non-ascii characters. This does: `########################################################## #! /usr/bin/perl use warnings; use strict; my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/; my %count; while (<DATA>) { my $line = lc $_; while ($line =~ /($word('$word)?)/g){ $count{$1}++; } } printf "%15s %5d\n", $_, $count{$_} for sort keys %count; __DATA__ "Hello World!" "Oh poor Yorick, his world I knew well yes I did" "don't won't, can't shouldn't, you'll, it's, etc." "Señor Montóya's resüme isn't ápropos."` [download]	[reply] [d/l]
Re: Word incidence count by holli (Abbot) on May 17, 2005 at 07:47 UTC
see How much can this text processing be optimized?, and my reply to it. holli, /regexed monk/	[reply]
Re^2: Word incidence count by redhotpenguin (Deacon) on May 17, 2005 at 07:58 UTC
That reply is a very good solution. Noted the use of the character class to eliminate possible counting alpha numeric words such as this123, which I did not explicitly specify how they should be handled in my original post.	[reply]
Re: Word incidence count by Skeeve (Parson) on May 17, 2005 at 07:56 UTC
`#!/usr/bin/env perl use strict; use warnings; use Data::Dumper qw( Dumper ); my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye +s I did"; # No change up to here # then just map ++$_,@count{split /\W+/,uc $string}; # that's it print "Word count: ", Dumper(%count); 1;` [download] Update: forgot the case-insensitiveness and changed the string to upper case `$\=~s;s.;q^\|D9JYJ^^qq^\//\\\///^;ex;print`	[reply] [d/l] [select]
Re: Word incidence count by Joost (Canon) on May 17, 2005 at 07:44 UTC
I'd probably do `my %counts; while (<>) { $counts{$_}++ for split; } print "$_: $counts{$_}\n" for sort keys %counts;` [download] "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^2: Word incidence count by holli (Abbot) on May 17, 2005 at 07:50 UTC
That leaves all the punctuation and other non-word characters and will tamper the results. (e.g. counting "word" and "word," as two different entities.) holli, /regexed monk/	[reply]
Re^3: Word incidence count by Joost (Canon) on May 17, 2005 at 07:53 UTC
True, and it also doesn't account for capitalisation. ~~It does give the same results as the OP's code.~~ - update: I just noticed it didn't :-) "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: Word incidence count by mrborisguy (Hermit) on May 17, 2005 at 13:31 UTC
how about this then? it splits on any non-word character, and takes acount for case. `my %counts; while (<>) { $counts{ lc $_ }++ for split /\W+/; } print "$_: $counts{$_}\n" for sort keys %counts;` [download]	[reply] [d/l]
Re: Word incidence count by perlsen (Chaplain) on May 17, 2005 at 07:50 UTC
Hai just try this `$str='Hello World!\n Oh poor Yorick, his world I knew well yes I did' +; $n=$str=~s#\w+#$&#gsi; print $n; or @n=$str=~m#\w+#gsi; print scalar(@n);` [download]	[reply] [d/l]
Re: Word incidence count by ishnid (Monk) on May 17, 2005 at 10:42 UTC
I remember having a similar problem posed as a fun challenge on another forum. The challenge was to extract word counts and print them sorted by frequency, using the fewest lines of code. Naturally, Perl won that one by a mile. This was the solution that was proposed (called by `perl script.pl < filename.txt'). It's ugly as hell but it's fun! `print "$_ = $count{$_}\n" for sort { $count{$b} <=> $count{$a} } grep +/\w/, map { lc $_ if (!$count{lc $_}++) } split( /[\W_]/ , join (' ' +, <>));` [download]	[reply] [d/l]
Re: Word incidence count by blazar (Canon) on May 17, 2005 at 08:17 UTC
`#!/usr/bin/env perl` [download] You know, there have been several discussions as to wether this really is more system-independent than, say, `/usr/bin/perl` is and basically it's very much like a flame war. This is not terribly perl-specific, but sometimes it comes around... `use strict; use warnings;` [download] Goood! `use Data::Dumper qw( Dumper ); my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye +s I did"; my @words = split( /\W+/, $string);` [download] Well, no harm done, but split is for... ehm... splitting. Here rather than matching on non-words to discard them you may want to match on word to gather them": `my @words = /\w+/g;` [download] <SNIP> `print "Word count: ", Dumper(%count); 1;` [download] Huh?!? This is not needed, by any means. It's used in modules - for a well defined reason, not relevant here. Yours is simply a script... All in all, well done!	[reply] [d/l] [select]
Re^2: Word incidence count by holli (Abbot) on May 17, 2005 at 08:27 UTC
Well, no harm done, but split is for... ehm... splitting. This usage of split is perfectly valid and appropriate. `my @words = /\w+/g;` Shouldn't that be `@words =~ /\w+/g;` ? Which of them is faster depends on the data. holli, /regexed monk/	[reply] [d/l] [select]
Re^3: Word incidence count by cLive ;-) (Prior) on May 17, 2005 at 08:39 UTC
If you're talking about a plain text document: print `wc -w /path/to/file.txt`; [download] cLive ;-)	[reply] [d/l]
Re^3: Word incidence count by blazar (Canon) on May 17, 2005 at 08:39 UTC
`my @words = /\w+/g;` [download] Shouldn't that be `@words =~ /\w+/g;` ? No. But indeed it is implictly assuming that the string to be matched is in `$_` which is where it usually is in my code, but which is not the case for the OP's example, actually. But the cure is really lightweight, however: `@words = $string =~ /\w+/g;` [download] Which of them is faster depends on the data. Please note that I didn't speak of speed. I was talking about the conceptual terseness of the concept of saying what it is that you want as opposed to that of saying what it is that is not among the stuff that you do not want.	[reply] [d/l] [select]
Re^4: Word incidence count by holli (Abbot) on May 17, 2005 at 08:42 UTC
Re: Word incidence count by sh1tn (Priest) on May 17, 2005 at 07:36 UTC
`... my $counter; $counter++ while $string =~ /\w+/g; print "Word count: $counter\n"; ...` [download]	[reply] [d/l]
Re: Word incidence count - the AWKward way by polettix (Vicar) on May 17, 2005 at 08:41 UTC
Just a flashback from the past... `#!/usr/bin/awk -f # wordfreq.awk --- print list of word frequencies { $0 = tolower($0) # remove case distinctions # remove punctuation gsub(/[^[:alnum:]_[:blank:]]/, "", $0) for (i = 1; i <= NF; i++) freq[$i]++ } END { for (word in freq) printf "%s\t%d\n", word, freq[word] }` [download] Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))') Don't fool yourself.	[reply] [d/l]
Re: Word incidence count by ghenry (Vicar) on May 17, 2005 at 10:19 UTC
I did this in Wordcounting with DBM files - can't open file? HTH. Walking the road to enlightenment... I found a penguin and a camel on the way..... Fancy a yourname@perl.me.uk? Just ask!!!	[reply]
Re: Word incidence count by gube (Parson) on May 17, 2005 at 07:41 UTC
Hi try this `my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye +s I did"; ($err) = $string =~ s#(\w+)#$1#g; print $err;` [download] Update: Using undef $/, you read all the text in one scalar and only word you replace by word and store in the variable. You may get the total count of the word of the document. Regards, `s,,aaagzas3uzttazs444ss12b3a222aaaamkyae,s,,y,azst1-4mky,,d&&print`	[reply] [d/l] [select]