Apart from the very good comments above, I would add:

(1) This sort of process is usually better off without having specific file names hard-coded in the script. You can use one or more command-line args for input files, or run some other command that prints text to stdout and pipe that to your script's STDIN. If your script prints results to STDOUT, you can either use redirection on the command line to create an output file (i.e.: your_script.pl some_files*.txt > word_hist.txt) or pipe the output to some other process.

(2) GrandFather already pointed out a different sorting method, but I think it's better to sort numerically, and then format the numbers for output. (If you really want leading zeros in the output, that's fine and easy, but you don't need to do that just to sort the output.) Also, for sets of words that occur with the same frequency, it's often useful to have them listed in alphabetical order.

(3) The OP method of conditioning the text will work fine so long as your input data is always ASCII-only text, but if you happen to end up with data that contains things like "pie à la mode" or "naïve", your results will be inaccurate (à won't be counted at all, and naïve will be counted as two "words", na and ve). In this case, you need to know what character encoding is being used (utf8?, cp1252? something else?), and decode the input accordingly.

Taking those points into account (and assuming utf8 as the most likely case for non-ASCII content):

#!/usr/bin/perl use strict; use warnings; use diagnostics; use open IN => ':utf8'; binmode STDIN, ':utf8'; binmode STDOUT, ':utf8'; my %freq; while (<>) { # reads from STDIN or from all file names in @ARGV $_ = lc(); s/[^a-z'-]+/ /g; for my $word ( split ) { $freq{$word}++; } } for ( sort { $freq{$b} <=> $freq{$a} || $a cmp $b } keys %freq ) { printf "%05d %s\n", $freq{$_}, $_; # or to list results on larger data sets without leading zeros: # printf "%9d %s\n", $freq{$_}, $_; }
(UPDATE: I was tempted to add a line or two inside the for my $word ( split ) loop, to remove initial and final apostrophes from each word - that's 'cuz some folks' typing habits include using apostrophes as single quotes - but sometimes an initial or final apostrophe should be treated as 'part of the word'. It's up to you how you want to handle that.)

In reply to Re: Code for generating a word frequency count not working by graff
in thread Code for generating a word frequency count not working by Pearl12345

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.