Re: How to find the most frequent in a file?

Replies are listed 'Best First'.
Re: Re: How to find the most frequent in a file? by Anonymous Monk on Jan 29, 2003 at 01:13 UTC
`#! /usr/local/bin/perl -w use strict ; my $file = $ARGV[0] ; open TEXT, "<$file"; my %frequency = () ; while ( my $line = <TEXT> ) { my(@words) = split /\W/, $line ; foreach my $word ( @words ) { $frequency{$word} = $frequency{$word} + 1 ; } } foreach my $word ( keys %frequency ) { my $count = $frequency{$word} ; print "$word\t\t$count\n"; }` [download] this is what i write. but i dunno how to made it into finding the most frequent word. Added code tags - dvergin 2003-01-28	[reply] [d/l]
Re: Re: Re: How to find the most frequent in a file? by pfaut (Priest) on Jan 29, 2003 at 01:37 UTC
Three comments on what you have already (which is 90% of the way there). In your split, you can avoid getting blank matches by using \W+ instead of \W. This will match a sequence of non-word characters instead of matching each and will keep your code from getting an empty string between consecutive non-word characters in the input text. When you increment the entry in the hash, use `$frequency{$word}++;` instead of `$frequency{$word} = $frequency{$word} + 1 ;` which gives warnings about using uninitialized values the first time any word is seen. Since you are going through the entire hash and collecting both keys and values, you could use `each` instead of `keys` to get them both at once. To finish this off, all you need to do is to keep track of the highest count seen while in the last loop. #! /usr/local/bin/perl -w use strict; my $file = $ARGV[0]; open TEXT, "<$file"; my %frequency = (); while ( my $line = <TEXT> ) { my(@words) = split /\W+/, $line ; foreach my $word ( @words ) { $frequency{$word}++; } } my @most; my $cnt; while (my ($word,$freq) = each %frequency ) { if (! @most) { push @most,$word; $cnt = $freq; next; } next if $cnt > $freq; if ($cnt == $freq) { push @most,$word; } else { @most = ($word); $cnt = $freq; } } if (@most == 0) { print "No words in $file\n"; } elsif (@most == 1) { print "'$most[0]' occurred $cnt times\n"; } else { print "The following words each appeared $cnt times\n@most\n"; } [download] An alternative would be to perform a Schwartzian Transform. `--- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';` [download]	[reply] [d/l] [select]
Re: Re: Re: Re: How to find the most frequent in a file? by Bilbo (Pilgrim) on Jan 29, 2003 at 09:49 UTC
This doesn't quite work if lines start with non-word characters such as brackets or leading spaces (as I discovered when I tried running this program on itself). If the first character(s) of the line are non-word characters then the first value returned by split is empty, and it may tell you that the most common word is the null string (''). A quick fix is to replace your `$frequency{$word}++;` line with `$frequency{$word}++ if ($word);` so the null string is ignored.	[reply] [d/l] [select]