in reply to a question about making a word frequency matrix
I think you need three separate things for this.
Firstly, you have to parse words out of the text. This depends a bit on what you count a word (which depends both on the language and your intent). Some months ago I've posted an example at Re: stripped punctuation.
For the words by paragraph matrix, I think you'd need a hashes of array structures. You can find some examples in perldoc perldsc.
You also need to find out the 100 most frequent words. After you've calculated the frequency of every word, you have multiple ways to do this. With a longer text, there is a fast solution to find them at Re: Puzzle: The Ham Cheese Sandwich cut.. However, it's simpler and almost as fast to use a heap: you have to insert the frequencies of the first 100 words to a heap then insert the frequency of each word and pop the least number from the heap alternatingly. You could use one of the CPAN modules Heap and Heap::Simple (but remember, just beacuse the name of a module is Simple or Light or Lite, it isn't neccessarily simpler to use than other modules). Or you can adapt my script at Re: Re: Re: Re: Sorting values of nested hash refs, which is simpler then a generic heap module, as it doesn't include the algorithm to pop a value from a heap. Since that, I've written a better heap implentation which I'll have to post sometimes. (Update: posted it, see Binary heap.) Using a heap has the advantage that it can give you the 100 most frequent words sorted by frequency.
However, the simplest solution is to sort all the different words by frequency (as I guess others will recommend). This might be slower than the other solutions but still not very slow, especially because sort is a builtin perl function.
|
|---|