Analysing text files to obtain statistics on their content.

Davo1977 has asked for the wisdom of the Perl Monks concerning the following question:

Analysing text files to obtain statistics on their content 

You are to write a Perl program that analyses text files to obtain sta
+tistics on their content. The program should operate as follows: 

1) When run, the program should check if an argument has been provided
+. If not, the program should prompt for, and accept input of, a filen
+ame from the keyboard. 

2) The filename, either passed as an argument or input from the keyboa
+rd, should be checked to ensure it is in MS-DOS format. The filename 
+part should be no longer than 8 characters and must begin with a lett
+er or underscore character followed by up to 7 letters, digits or und
+erscore characters. The file extension should be optional, but if giv
+en is should be ".TXT" (upper- or lowercase). 

If no extension if given, ".TXT" should be added to the end of the fil
+ename. So, for example, if "testfile" is input as the filename, this 
+should become "testfile.TXT". If "input.txt" is entered, this should 
+remain unchanged. 

3) If the filename provided is not of the correct format, the program 
+should display a suitable error message and end at this point. 

4) The program should then check to see if the file exists using the f
+ilename provided. If the file does not exist, a suitable error messag
+e should be displayed and the program should end at this point. 

5) Next, if the file exists but the file is empty, again a suitable er
+ror message should be displayed and the program should end. 

6) The file should be read and checked to display crude statistics on 
+the number of characters, words, lines, sentences and paragraphs that
+ are within the file. 



I am very new to Perl and have managed to compile this code using exam
+ples from various books. Could anyone oversee this coding and see how
+ it could be improved. 

#!/usr/bin/perl 

use strict; 
use warnings; 

if ($#ARGV == -1) #no filename provided as a command line argument. 
{ 
print("Please enter a filename: "); 
$filename = <STDIN>; 
chomp($filename); 
} 
else #got a filename as an argument. 
{ 
$filename = $ARGV[0]; 
} 

#perform the specified checks 
#check if filename is valid, exit if not 
if ($filename !~ m^/[a-z]{1,7}\.TXT$/i) 
{ 
die("File format not valid\n");) 
} 

if ($filename !~ m/\.TXT$/i) 
{ 
$filename .= ".TXT"; 
} 

#check if filename is actual file, exit if it is. 
if (-e $filename) 
{ 
die("File does not exist\n"); 
} 

#check if filename is empty, exit if it is. 
if (-s $filename) 
{ 
die("File is empty\n"); 
} 

my $i = 0; 
my $p = 1; 
my $words = 0; 
my $chars = 0; 

open(READFILE, "<$data1.txt") or die "Can't open file '$filename: $!";
+ 

#then use a while loop and series of if statements similar to the foll
+owing 
while (<READFILE>) { 
chomp; #removes the input record Separator 
$i = $.; #"$". is the input record line numbers, $i++ will also work 
$p++ if (m/^$/); #count paragraphs 
$my @t = split (/\s+/); #split sentences into "words" 
$words += @t; #add count to $words 
$chars += tr/ //c; #tr/ //c count all characters except spaces and add
+ to $chars 
} 


#display results 
print "There are $i lines in $data1\n"; 
print "There are $p Paragraphs in $data1\n"; 
print "There are $words in $data1\n"; 
print "There are $chars in $data1\n"; 

close(READFILE);
[download]

Comment on Analysing text files to obtain statistics on their content. Download Code

Replies are listed 'Best First'.
Re: Analysing text files to obtain statistics on their content. by kyle (Abbot) on Jun 25, 2008 at 17:10 UTC
Welcome to the Monastery. You might want to have a look at Writeup Formatting Tips This is a syntax error: `if ($filename !~ m^/[a-z]{1,7}\.TXT$/i)` [download] I think you meant this: `if ($filename !~ m/^[a-z]{1,7}\.TXT$/i)` [download] At the start, you store your input file's name in `$filename`, but when you open it, you expect the name to be in `$data1`. This is a syntax error: `$my @t = split (/\s+/); #split sentences into "words"` [download] I think you meant: `my @t = split (/\s+/); #split sentences into "words"` [download] You could also have written that as just "`my @t = split`" because `/\s+/` is the default pattern. Those are the show stoppers. There are a lot of stylistic concerns (indent!) that won't prevent it from working, but I'll stop here. Update: Oops, I missed this syntax error: `die("File format not valid\n");)` [download] Just delete the trailing paren. You also have to declare (with my) `$filename` or `$data1` or both.	[reply] [d/l] [select]
Re: Analysing text files to obtain statistics on their content. by toolic (Bishop) on Jun 25, 2008 at 17:09 UTC
Welcome to the Monastery. It is great that you used code tags to surround your code, but it would be better if you did not include the body of your question in code tags. The first thing you must do to improve your code is to get it to compile. It is good that you `use strict;`, but now you must declare all variables with my. The next thing to do is to improve your code's readability by using indentation.	[reply] [d/l]
Re: Analysing text files to obtain statistics on their content. by moritz (Cardinal) on Jun 25, 2008 at 17:14 UTC
I assume you tested your script, and it works like expected. As to the matter of style, I can strongly recommend to properly indent your code, one tab (or 4 or 8 spaces or whatever) per opening bracket. That way you'll actually see what it's doing. Secondly I suggest you use lexical variables as file handles, and use the three-argument for of open. See perlopentut for more details. See perlstyle for more tips on how to make your code readable. If you want more in-depth analysis of your script, install Perl::Critic, it might give you some helpful at wise. Now I can only recommend that you read Writeup Formatting Tips and update your posting to use paragraphs for text, and code tags only for code. Helping with homework isn't much fun, btw.	[reply]
Re: Analysing text files to obtain statistics on their content. by johngg (Canon) on Jun 25, 2008 at 19:43 UTC
Welcome to the Monastery. I've upvoted your post because you have made an effort to write some code and also do some research; don't worry too much about the downvotes as formatting faux pas often receive a negative reaction. Please don't let this put you off the Monastery, keep visiting as it is a great learning resource. Other Monks have given good advice and have covered the most important points. I would just like to draw your attention to a couple of further glitches. A minor point, you have forgotten to say what you are counting in the last two lines of your output :-) More seriously, your logic for counting paragraphs will fail if you have a trailing empty line (or lines) at the end of your data file. You should also consider how you wish to treat lines that just contain spaces and also multiple blank lines between paragraphs. I hope these thoughts are helpful. Cheers, JohnGG	[reply]
Re^2: Analysing text files to obtain statistics on their content. by Gavin (Archbishop) on Jun 26, 2008 at 09:32 UTC
I second that, give Davo1977 a break he's only been here a couple of days and had made an effort to format the post. We need to be encouraging new Perl programmers to the language and the Community not discouraging with negative XP at the slightest thing.	[reply]