perl_seeker has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,
I have a plain text file which is large, now contains about one lakh words, may contain upto one million or so words in future.

I also have with me a simple script, which splits the text on whitespace into an array, removes punctuation symbols,
and counts the no of total words. It then counts the number of unique words after removing duplicate words.
I also need to list the words top down, in terms of their frequency of occurence in the original text(before removing duplicates)
Most frequently occuring word no of times occured second most frequent word no of times occured .... .... .... ....
This bit of code which I got from the Perl cook book, uses a hash to count the number of times each word occurs.

%count = (); foreach $element (@words) { $count{$element}++; } while ( ($k,$v) = each %count ) { print "$k => $v\n"; }
Printing the hash gives me the words and their frequency counts, but how do we sort this list to have the most
frequently occuring one at the top, then the second most frequent, and so on? e.g.
the 150 it 85 we 60 are 40
Also, since I need to use a very large file, is it possible to do the whole exercise using a hash: split text
into a hash, count the no. of total words, remove duplicates, count no. of unique words, and also do the frequency count

Here is the code I have reading the words into an array:
sub lexicon_generate { open CP, 'tcorpus.txt' or die $!; #Open file. my @words; while(<CP>){ chomp; push @words,split; } close CP; #print "\n@words\n"; $lwords=@words; #print "\n$lwords"; for($i=0;$i<$lwords;$i++) { #print "\nThis is the next token:"; #print "\n$words[$i]"; } #Remove punctuation marks. foreach my $item(@words){ $item=~ tr/*//d; $item=~ tr/(//d; $item=~ tr/)//d; $item=~ tr/""//d; $item=~ tr/''//d; $item=~ tr/?//d; $item=~ tr/,//d; $item=~ tr/. //d; $item=~ tr/-//d; $item=~ tr/"//d; $item=~ tr/'//d; $item=~ tr/!//d; $item=~ tr/;//d; $item= '' unless defined $item; #print "\nThe token after removing punctuation marks:"; #print "\n$item\n"; } #Number of words in @words before removing duplicates. $lnwords=@words; #print "\n$lnwords"; foreach my $final_thing(@words){ #print "$final_thing\n"; } + #Remove duplicate strings. my %seen = (); my @uniq = (); foreach my $u_thing(@words) { unless ($seen{$u_thing}) { #if we get here, we have not seen it before $seen{$u_thing} = 1; push (@uniq,$u_thing); } } #print"\nThe unique list:"; #print "\n@uniq"; #Number of words in @words after removing duplicates. $luniq=@uniq; #print "\n$luniq"; open LEX,'>tcorpus_unique.txt' or die $!; foreach my $u_elt(@uniq){ #print "\n$u_elt"; print LEX "\n$u_elt"; } close LEX; } &lexicon_generate();
Any sample code using a hash would be most appreciated.
Thanx,
perl_seeker:)

2005-03-18 Janitored by Arunbear - added readmore tags, as per Monastery guidelines

Replies are listed 'Best First'.
Re: Frequency of words in text file and hashes
by Zaxo (Archbishop) on Mar 18, 2005 at 11:11 UTC

    You're doing the counts just right (but you may want to translate punctuation before splitting). You only need to manipulate the hash a bit to get what you want. Taking keys of the hash will give you the list of distinct words, which can be sorted by their value in %count:

    my @wordlist = sort {$count{$b} <=> $count{$a}} keys %count; print 'The text contains ', scalar(@wordlist), ' distinct words', "\n" +; print "$_\t$count{$_}\n" for @wordlist;
    You can translate punctuation in just one statement by putting all the punctuation characters in the first slot and using the /d switch.

    After Compline,
    Zaxo

Re: Frequency of words in text file and hashes
by rev_1318 (Chaplain) on Mar 18, 2005 at 11:44 UTC
    As others already explained, sort is your friend. You could, however, improve your code by:
    • parsing the file line by line, instead of slurping the entire file and splitting it in a array
    • define what a 'word' is: does case matter? are punctuations/numers allowed?

    assuming that case doesn't matter and words are defined by </code>\w+</code>, this code should work:

    my %count; while ( <FILE> ) { $count{lc $_}++ for /\w+/g; } print "$_ => $count{$_}\n" for sort { $count{$b} <=> $count{$a} || $a cmp $b} keys %count;
    (notice the extra sort for equal ranking words)

    HTH,
    Paul

Re: Frequency of words in text file and hashes
by thinker (Parson) on Mar 18, 2005 at 11:10 UTC

    Hi perl_seeker

    This should do what you want

    for (sort {$count{$b} <=> $count{$a}} keys %count){ print "$_ $count{$_}\n"; };

    thinker

      Hello!
      "my apologies to everyone for the delayed response."

      Thanks for the code thinker, works fine.
      perl_seeker:)
Re: Frequency of words in text file and hashes
by deibyz (Hermit) on Mar 18, 2005 at 11:25 UTC
    Once you have the words splitted in an array, you can grab all the information from the hash. I.e.:
    my $total_words = @words; #All word count my %count; $count{$_}++ for @words; # Here are the counts my $uniq_words = scalar keys %count; # Nš of uniq words # Print sorted by frequency print "$_\t$count{$_}" for (sort { $count{$b} <=> $count{$a} } keys %c +ount);

    Hope it helps,
    deibyz

Re: Frequency of words in text file and hashes
by TedPride (Priest) on Mar 18, 2005 at 17:37 UTC
    The only punctuation marks you really have to worry about are - and '. If - occurs at the end of the line, the word closest to the end of the line has been split between that line and the next line, and has to be wrapped over. If ' occurs inside a word, it's part of the word and has to be preserved; otherwise, it is removed. All other punctuation marks are removed. A sloppy version of this is as follows:

    EDIT: I wasn't accounting for extra spaces on the start, end or middle of each line. Updated code and added comments.

    EDIT: Also updated so comments fit properly.

    use strict; use warnings; my (%count, $last); my $max = 0; while (<DATA>) { s/^ +//; s/\s+$//; s/ +/ /g; ## Remove extra spaces $_ = lc($_); ## Lowercase so not sensitive $_ = $last . $_; ## Append word piece if (m/-$/) { ## If line ends in -, remove last ($_, $last) = m/(.*?) ?(\w+)-/; ## word piece for appending } else { $last = ''; } ## Else word piece is nothing s/[^\w' ]//g; ## Remove extra punctuation s/(\w)'(\w)/$1-$2/g; ## Convert ' inside words to - s/'//g; ## Remove all remaining ' s/-/'/g; ## Convert - back to ' for (split / +/) { ## Split on space and $count{$_}++; ## process words $max = length() if length() > $max; ## Find longest word size } } print sprintf('%'.$max.'s', $_) . " => $count{$_}\n" for sort { $count{$b} <=> $count{$a} || $a cmp $b} keys %count; __DATA__ I can't be bought, and I won't be bought! My school- house is my own, my precious. One school to rule them all! This is my cant, my creed. Funky space check!
      Hello Ted,
      thanks for the code and your ideas. I am actually working with text in a font for another language
      (not English), so the tokenisation and translation code does not work for me, but of course works
      great with your test data in English. But anyway, got the idea.

      In this font, a single letter(vowel/consonant), may be mapped to two or more ascii characters, e.g.
      letter 1 in my font = ascii chars sd letter 2 in my font = ascii chars !#
      Though of course we can do the frequency count of words from %count, and also find the no of unique words from it;we are still building %count from an array into which the words have been pushed.
      Right now this works, but if the array were to hold a huge number of words, say 1 million,would this not be a problem? Is there a way around this?
      Thanks,
      perl_seeker:)