Re: Word incidence count
by thundergnat (Deacon) on May 17, 2005 at 15:06 UTC
|
Something which none of the other solutions posted here so far takes into account, is words with internal apostrophes. (don't won't, can't shouldn't, you'll, it's, etc.) And most don't deal with non-ascii characters.
This does:
##########################################################
#! /usr/bin/perl
use warnings;
use strict;
my $word = qr/(?<!\p{Alnum})\p{Alnum}+(?!\p{Alnum})/;
my %count;
while (<DATA>) {
my $line = lc $_;
while ($line =~ /($word('$word)?)/g){
$count{$1}++;
}
}
printf "%15s %5d\n", $_, $count{$_} for sort keys %count;
__DATA__
"Hello World!"
"Oh poor Yorick, his world I knew well yes I did"
"don't won't, can't shouldn't, you'll, it's, etc."
"Señor Montóya's resüme isn't ápropos."
| [reply] [d/l] |
Re: Word incidence count
by holli (Abbot) on May 17, 2005 at 07:47 UTC
|
| [reply] |
|
|
That reply is a very good solution. Noted the use of the character class to eliminate possible counting alpha numeric words such as this123, which I did not explicitly specify how they should be handled in my original post.
| [reply] |
Re: Word incidence count
by Skeeve (Parson) on May 17, 2005 at 07:56 UTC
|
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper qw( Dumper );
my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye
+s I did";
# No change up to here
# then just
map ++$_,@count{split /\W+/,uc $string};
# that's it
print "Word count: ", Dumper(%count);
1;
Update: forgot the case-insensitiveness and changed the string to upper case
$\=~s;s*.*;q^|D9JYJ^^qq^\//\\\///^;ex;print
| [reply] [d/l] [select] |
Re: Word incidence count
by Joost (Canon) on May 17, 2005 at 07:44 UTC
|
my %counts;
while (<>) {
$counts{$_}++ for split;
}
print "$_: $counts{$_}\n" for sort keys %counts;
| [reply] [d/l] |
|
|
That leaves all the punctuation and other non-word characters and will tamper the results. (e.g. counting "word" and "word," as two different entities.)
| [reply] |
|
|
True, and it also doesn't account for capitalisation. It does give the same results as the OP's code. - update: I just noticed it didn't :-)
| [reply] |
|
|
how about this then? it splits on any non-word character, and takes acount for case.
my %counts;
while (<>) {
$counts{ lc $_ }++ for split /\W+/;
}
print "$_: $counts{$_}\n" for sort keys %counts;
| [reply] [d/l] |
Re: Word incidence count
by perlsen (Chaplain) on May 17, 2005 at 07:50 UTC
|
$str='Hello World!\n Oh poor Yorick, his world I knew well yes I did'
+;
$n=$str=~s#\w+#$&#gsi;
print $n;
or
@n=$str=~m#\w+#gsi;
print scalar(@n);
| [reply] [d/l] |
Re: Word incidence count
by ishnid (Monk) on May 17, 2005 at 10:42 UTC
|
I remember having a similar problem posed as a fun challenge on another forum. The challenge was to extract word counts and print them sorted by frequency, using the fewest lines of code. Naturally, Perl won that one by a mile. This was the solution that was proposed (called by `perl script.pl < filename.txt'). It's ugly as hell but it's fun!
print "$_ = $count{$_}\n" for sort { $count{$b} <=> $count{$a} } grep
+/\w/, map { lc $_ if (!$count{lc $_}++) } split( /[\W_]/ , join (' '
+, <>));
| [reply] [d/l] |
Re: Word incidence count
by blazar (Canon) on May 17, 2005 at 08:17 UTC
|
#!/usr/bin/env perl
You know, there have been several discussions as to wether this really is more system-independent than, say, /usr/bin/perl is and basically it's very much like a flame war. This is not terribly perl-specific, but sometimes it comes around...
use strict;
use warnings;
Goood!
use Data::Dumper qw( Dumper );
my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye
+s I did";
my @words = split( /\W+/, $string);
Well, no harm done, but split is for... ehm... splitting. Here rather than matching on non-words to discard them you may want to match on word to gather them":
my @words = /\w+/g;
<SNIP>
print "Word count: ", Dumper(%count);
1;
Huh?!? This is not needed, by any means. It's used in modules - for a well defined reason, not relevant here. Yours is simply a script...
All in all, well done!
| [reply] [d/l] [select] |
|
|
Well, no harm done, but split is for... ehm... splitting.
This usage of split is perfectly valid and appropriate.
my @words = /\w+/g;
Shouldn't that be
@words =~ /\w+/g;
?
Which of them is faster depends on the data.
| [reply] [d/l] [select] |
|
|
If you're talking about a plain text document:
print `wc -w /path/to/file.txt`;
cLive ;-) | [reply] [d/l] |
|
|
my @words = /\w+/g;
Shouldn't that be @words =~ /\w+/g; ?
No. But indeed it is implictly assuming that the string to be matched is in $_ which is where it usually is in my code, but which is not the case for the OP's example, actually. But the cure is really lightweight, however:
@words = $string =~ /\w+/g;
Which of them is faster depends on the data.
Please note that I didn't speak of speed. I was talking about the conceptual terseness of the concept of saying what it is that you want as opposed to that of saying what it is that is not among the stuff that you do not want.
| [reply] [d/l] [select] |
|
|
Re: Word incidence count
by sh1tn (Priest) on May 17, 2005 at 07:36 UTC
|
...
my $counter;
$counter++ while $string =~ /\w+/g;
print "Word count: $counter\n";
...
| [reply] [d/l] |
Re: Word incidence count - the AWKward way
by polettix (Vicar) on May 17, 2005 at 08:41 UTC
|
#!/usr/bin/awk -f
# wordfreq.awk --- print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
Flavio (perl -e 'print(scalar(reverse("\nti.xittelop\@oivalf")))')
Don't fool yourself.
| [reply] [d/l] |
Re: Word incidence count
by ghenry (Vicar) on May 17, 2005 at 10:19 UTC
|
| [reply] |
Re: Word incidence count
by gube (Parson) on May 17, 2005 at 07:41 UTC
|
my $string = "Hello World!\n Oh poor Yorick, his world I knew well ye
+s I did";
($err) = $string =~ s#(\w+)#$1#g;
print $err;
Update: Using undef $/, you read all the text in one scalar and only word you replace by word and store in the variable. You may get the total count of the word of the document.
Regards,
s,,aaagzas3uzttazs444ss12b3a222aaaamkyae,s,,y,azst1-4mky,,d&&print | [reply] [d/l] [select] |