perl_seeker has asked for the wisdom of the Perl Monks concerning the following question:
This bit of code which I got from the Perl cook book, uses a hash to count the number of times each word occurs.Most frequently occuring word no of times occured second most frequent word no of times occured .... .... .... ....
Printing the hash gives me the words and their frequency counts, but how do we sort this list to have the most%count = (); foreach $element (@words) { $count{$element}++; } while ( ($k,$v) = each %count ) { print "$k => $v\n"; }
Also, since I need to use a very large file, is it possible to do the whole exercise using a hash: split textthe 150 it 85 we 60 are 40
Any sample code using a hash would be most appreciated.sub lexicon_generate { open CP, 'tcorpus.txt' or die $!; #Open file. my @words; while(<CP>){ chomp; push @words,split; } close CP; #print "\n@words\n"; $lwords=@words; #print "\n$lwords"; for($i=0;$i<$lwords;$i++) { #print "\nThis is the next token:"; #print "\n$words[$i]"; } #Remove punctuation marks. foreach my $item(@words){ $item=~ tr/*//d; $item=~ tr/(//d; $item=~ tr/)//d; $item=~ tr/""//d; $item=~ tr/''//d; $item=~ tr/?//d; $item=~ tr/,//d; $item=~ tr/. //d; $item=~ tr/-//d; $item=~ tr/"//d; $item=~ tr/'//d; $item=~ tr/!//d; $item=~ tr/;//d; $item= '' unless defined $item; #print "\nThe token after removing punctuation marks:"; #print "\n$item\n"; } #Number of words in @words before removing duplicates. $lnwords=@words; #print "\n$lnwords"; foreach my $final_thing(@words){ #print "$final_thing\n"; } + #Remove duplicate strings. my %seen = (); my @uniq = (); foreach my $u_thing(@words) { unless ($seen{$u_thing}) { #if we get here, we have not seen it before $seen{$u_thing} = 1; push (@uniq,$u_thing); } } #print"\nThe unique list:"; #print "\n@uniq"; #Number of words in @words after removing duplicates. $luniq=@uniq; #print "\n$luniq"; open LEX,'>tcorpus_unique.txt' or die $!; foreach my $u_elt(@uniq){ #print "\n$u_elt"; print LEX "\n$u_elt"; } close LEX; } &lexicon_generate();
2005-03-18 Janitored by Arunbear - added readmore tags, as per Monastery guidelines
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Frequency of words in text file and hashes
by Zaxo (Archbishop) on Mar 18, 2005 at 11:11 UTC | |
|
Re: Frequency of words in text file and hashes
by rev_1318 (Chaplain) on Mar 18, 2005 at 11:44 UTC | |
|
Re: Frequency of words in text file and hashes
by thinker (Parson) on Mar 18, 2005 at 11:10 UTC | |
by perl_seeker (Scribe) on Mar 24, 2005 at 09:36 UTC | |
|
Re: Frequency of words in text file and hashes
by deibyz (Hermit) on Mar 18, 2005 at 11:25 UTC | |
|
Re: Frequency of words in text file and hashes
by TedPride (Priest) on Mar 18, 2005 at 17:37 UTC | |
by perl_seeker (Scribe) on Mar 28, 2005 at 11:36 UTC |