Code for generating a word frequency count not working

Pearl12345 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Code for generating a word frequency count not working
by Athanasius (Archbishop) on Nov 24, 2014 at 02:05 UTC

Hello Pearl12345, and welcome to the Monastery!

You open two filehandles, TEXT for reading and OUT for writing; but then you use only the first. The warning/diagnostic (not error) message is alerting you to the fact that you open OUT but never use it. Most likely, you intended the second-last line to be:

print OUT join("\n", @finalarray);
[download]

BTW, it’s good practice to use strict; and to declare all variable as lexicals (i.e., using my); also, to use lexical variables for filehandles, and to use the three argument form of open:

use strict;
use warnings;
use diagnostics;
open(my $text, '<', "C:/Users/Customer/Desktop/New folder/Perl/1dfre10
+.TXT");
open(my $out,  '>', "C:/Users/Customer/Desktop/New folder/Perl/1dfre10
+.OUT");
undef($/);
my $all_text = <TEXT>;
$all_text = lc($all_text);
$all_text =~ s/[^a-z\-\']/ /g;
my @wordarray = split(/[\n\s]+/, $all_text);
...
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Code for generating a word frequency count not working

by Pearl12345 (Initiate) on Nov 24, 2014 at 02:27 UTC

Okay thanks it worked!

[reply]

Re: Code for generating a word frequency count not working
by GrandFather (Saint) on Nov 24, 2014 at 02:11 UTC

Athanasius beat me to the draw for your immediate issue and the good style advice. He missed suggesting you should check your opens however.

Perl provides opportunities to tidy up your code a little. Consider:

#!/usr/local/bin/perl
use strict;
use warnings;
use diagnostics;

my $path = 'C:/Users/Customer/Desktop/New folder/Perl/';
my $inName = '1dfre10.TXT';
my $outName = '1dfre10.OUT';

open my $in, '<', "$path$inName" or die "Can't open '$inName': $!\n";

my $all_text = lc do {local $/; <$in>};
my %freq;

$all_text =~ s/[^a-z\-\']/ /g;
++$freq{$_} for split /[\n\s]+/, $all_text;

open my $out, '>', "$path$outName" or die "Can't create '$outName': $!
+\n";
print $out join "\n", reverse sort map {sprintf "%05d $_", $freq{$_}} 
+keys %freq;
[download]

Untested, but should provoke a little thought.

Perl is the programming world's equivalent of English

[reply]
[d/l]

Re: Code for generating a word frequency count not working
by graff (Chancellor) on Nov 24, 2014 at 05:36 UTC

(1) This sort of process is usually better off without having specific file names hard-coded in the script. You can use one or more command-line args for input files, or run some other command that prints text to stdout and pipe that to your script's STDIN. If your script prints results to STDOUT, you can either use redirection on the command line to create an output file (i.e.: your_script.pl some_files*.txt > word_hist.txt) or pipe the output to some other process.

(2) GrandFather already pointed out a different sorting method, but I think it's better to sort numerically, and then format the numbers for output. (If you really want leading zeros in the output, that's fine and easy, but you don't need to do that just to sort the output.) Also, for sets of words that occur with the same frequency, it's often useful to have them listed in alphabetical order.

(3) The OP method of conditioning the text will work fine so long as your input data is always ASCII-only text, but if you happen to end up with data that contains things like "pie à la mode" or "naïve", your results will be inaccurate (à won't be counted at all, and naïve will be counted as two "words", na and ve). In this case, you need to know what character encoding is being used (utf8?, cp1252? something else?), and decode the input accordingly.

Taking those points into account (and assuming utf8 as the most likely case for non-ASCII content):

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;

use open IN => ':utf8';
binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';

my %freq;
while (<>) {  # reads from STDIN or from all file names in @ARGV
    $_ = lc();
    s/[^a-z'-]+/ /g;
    for my $word ( split ) {
        $freq{$word}++;
    }
}
for ( sort { $freq{$b} <=> $freq{$a} || $a cmp $b } keys %freq ) {
    printf "%05d %s\n", $freq{$_}, $_;
    # or to list results on larger data sets without leading zeros:
    #  printf "%9d %s\n", $freq{$_}, $_;
}
[download]

for my $word ( split )

[reply]
[d/l]
[select]

Re: Code for generating a word frequency count not working
by vinoth.ree (Monsignor) on Nov 24, 2014 at 05:56 UTC

Hi Pearl12345,

For your more information on your error(not error at all, it is a warning) message, I am giving you the use of warnings pragma with the exampls, similar to your warning message.

Warnings

The most important tool for writing good Perl is the 'warnings' flag, the -w command line switch. You can turn on warnings by placing -w on the first line of your programs like so:

#!/usr/local/bin/perl -w

Or, if you're running a program from the command line, you can use -w there, as in perl -w myprogram.pl.

Turning on warnings will make Perl yelp and complain at a huge variety of things that are almost always sources of bugs in your programs. Perl normally takes a relaxed attitude toward things that may be problems; it assumes that you know what you're doing, even when you don't.

Here's an example of a program that Perl will be perfectly happy to run without blinking, even though it has an error on almost every line!

       #!/usr/local/bin/perl     
          $filename = "./logfile.txt";
          open (LOG, $fn);
          print LOG "Test\n";
          close LOGFILE;
[download]

Now, add the -w switch to the first line, and run it again. You should see something like this:

Name "main::filename" used only once: possible typo at ./a6-warn.pl line 3. Name "main::LOGFILE" used only once: possible typo at ./a6-warn.pl line 6. Name "main::fn" used only once: possible typo at ./a6-warn.pl line 4. Use of uninitialized value at ./a6-warn.pl line 4. print on closed filehandle main::LOG at ./a6-warn.pl line 5.

Here's what each of these errors means:

1. Name "main::filename" used only once: possible typo at ./a6-warn.pl line 3. and Name "main::fn" used only once: possible typo at ./a6-warn.pl line 4. Perl notices that $filename and $fn both only get used once, and guesses that you've misspelled or misnamed one or the other. This is because this almost always happens because of typos or bugs in your code, like using $filenmae instead of $filename, or using $filename throughout your program except for one place where you use $fn

2. Name "main::LOGFILE" used only once: possible typo at ./a6-warn.pl line 6. In the same way that we made our $filename typo, we mixed up the names of our filehandles: We use LOG for the filehandle while we're writing the log entry, but we try to close LOGFILE instead.

3. Use of uninitialized value at ./a6-warn.pl line 4. This is one of Perl's more cryptic complaints, but it's not difficult to fix. This means that you're trying to use a variable before you've assigned a value to it, and that is almost always an error. When we first mentioned $fn in our program, it hadn't been given a value yet. You can avoid this type of warning by always setting a default value for a variable before you first use it.

4. print on closed filehandle main::LOG at ./a6-warn.pl line 5. We didn't successfully open LOG, because $fn was empty. When Perl sees that we are trying to print something to the LOG filehandle, it would normally just ignore it and assume that we know what we're doing. But when -w is enabled, Perl warns us that it suspects there's something afoot.

So, how do we fix these warnings? The first step, obviously, is to fix these problems in our script. (And while we're at it, I deliberately violated our rule of always checking if open() succeeded! Let's fix that, too.) This turns it into:

#!/usr/local/bin/perl -w 
        $filename = "./logfile.txt";
        open (LOG, $filename) or die "Couldn't open $filename: $!";
        print LOG "Test\n";
        close LOG;
[download]

Now, we run our corrected program, and get this back from it:

Filehandle main::LOG opened only for input at ./a6-warn2.pl line 5.

Where did this error come from? Look at our open(). Since we're not preceding the filename with > or >>, Perl opens the file for reading, but in the next line we're trying to write to it with a print. Perl will normally let this pass, but when warnings are in place, it alerts you to possible problems. Change line 4 to this instead and everything will be great:

open (LOG, ">>$filename") or die "Couldn't open $filename: $!";

The <-w> flag is your friend. Keep it on at all times. You may also want to read the <perldiag> man page, which contains a listing of all the various messages (including warnings) Perl will spit out when it encounters a problem. Each message is accompanied by a detailed description of what the message means and how to fix it.

Source from: Perl.com

All is well

[reply]
[d/l]
[select]

Re^2: Code for generating a word frequency count not working

by GrandFather (Saint) on Nov 24, 2014 at 07:23 UTC

Note that use warnings; is the preferred way of turning warnings on in modern Perl, along with lexical file handles, three parameter open and use strict;. In particular, don't use -w on the command line or in the shebang line at the start of the script.

Actually, you should always use both strictures (use strict; use warnings; - see The strictures, according to Seuss), not just warnings.

If you were paying attention you would see that the OP was already using warnings and that the other style recommendations were made and illustrated in replies.

Update: oh, and you can link to Perl documentation using [doc://perldiag] - see Linking on PerlMonks.

Perl is the programming world's equivalent of English

[reply]
[d/l]
[select]

Re^2: Code for generating a word frequency count not working

by Laurent_R (Canon) on Nov 24, 2014 at 07:28 UTC

-w

use warnings;
[download]

Update; Beaten by a few minutes by GrandFather on this one. Corrected a typo.

[reply]
[d/l]
[select]