Statistics from a txtfile

mbdc566021 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Statistics from a txtfile by suaveant (Parson) on Dec 28, 2007 at 18:38 UTC
you probably want a regexp more like `/^[_a-zA-Z][^.]{1,7}\.txt$/` [download] This matches an underscore or letter (the {1} is unnecessary, char classes always match one char only, which is why you need modifiers like * or + to make it do something different) Then you can do `while(<READFILE>) { # something here }` [download] which will go through the file line by line putting the data in $_ I would read up on chomp and split and regular expressions on how to work with the data in the file. P.S. put <code> tags around your code to make it display properly in a post. - Ant - Some of my best work - (1 2 3)	[reply] [d/l] [select]
Re^2: Statistics from a txtfile by mbdc566021 (Initiate) on Jan 02, 2008 at 18:27 UTC
Have read up on regex and still confused.Am i anywhere near correct with the variables? #!/usr/bin/perl print("Please enter filename: "); $filename = <STDIN>; chomp ($filename); if ($filename=~m/^[_a-zA-Z][^.]{1,7}\.txt$/) { open (READFILE, $filename)\|\| die "Failed to open $filename: $!"; } my $filecontents; { local undef $/; $filecontents = <READFILE>; } close <READFILE>; #slurps the whole file into variable my @characters = $filecontents =~ m/./g; # puts a copy of each match + into @characters my $CharCount = scalar @characters; # this counts the number of + elements in @characters my @words = $filecontents =~m/\b\s/g; # number of words my @paragraph =($ # number of paragraphs. wha +t is code for new line char ie carriage return? my @sentences = $filecontents=~m/\.$/g; # number of sentences for (@characters){ print "$_ \n"; } # this will print a list +of each item counted: # output data # open(OUT, ">data1.txt") \|\| die "data1.txt not open: $!"; # close(OUT); [download]	[reply] [d/l]
Re: Statistics from a txtfile by cdarke (Prior) on Dec 28, 2007 at 19:11 UTC
A few other things: When you read the filename from STDIN it will include a trailing new-line character, so get ride of it with : `chomp $filename;` When reporting an error from an open (or any other system related function) include the system error held in $!, so you know why it failed: `open (READFILE, $filename)\|\| die "Failed to opne $filename: $!";` [download] There is plenty more to do, but I suggest you get some basic code working,then embelish it.	[reply] [d/l] [select]
Re: Statistics from a txtfile by hangon (Deacon) on Dec 28, 2007 at 21:51 UTC
To get the statistics you want, it may be easier to slurp the whole file into a variable then process it through a series of regexes. Here's the basic idea, but as suaveant suggests - read up on regular expressions: `open (READFILE, $filename)\|\| die "Failed to open $filename: $!"; my $filecontents; { local undef $/; $filecontents = <READFILE>; } close <READFILE>; my @words = $filecontents =~ / ... /g; my $wordcount = scalar @words; my @characters = ... my @sentences = ... # etc` [download]	[reply] [d/l]
Re^2: Statistics from a txtfile by mbdc566021 (Initiate) on Dec 30, 2007 at 15:49 UTC
Hi There, I have reviewed my code and have put it into some sort of structure. It was to the problem that i had to: ask for a filename check file to see if its ms-dos file format filename should be no longer than 8 characters and should begin with an _underscore or letter and should end with .txt not case sensitive if not then it should add .txt program should check whether file exists and not empty should read the file by character and get the following statistics: character count including whitespace the punctuation. number of words. paragraphs. lines and sentences. output details to a separate .txt file. #!/usr/bin/perl print("Please enter filename: "); $filename = <STDIN>; chomp ($filename); if ($filename=~m/^[_a-zA-Z][^.]{1,7}\.txt$/) open (READFILE, $filename)\|\| die "Failed to open $filename: $!"; my $filecontents; { local undef $/; $filecontents = <READFILE>; } while (<READFILE>) my @characters = ($filecontents =~m/\b/g); my @words = ($filecontents =~m/\b\s/g); my $wordcount = scalar @words; my @paragraph =($ my @sentences = ($filecontents=~m/\.$/); $CharCount{ $characters }++; $wordcount{ $wordcount)++; etc close <READFILE>; open(OUT, ">data1.txt") \|\| die "data1.txt not open: $!"; output data here close(OUT); [download] This is as far as I have got. Could you please elaborate on my coding further? thankyou kindly	[reply] [d/l]
Re^3: Statistics from a txtfile by hangon (Deacon) on Dec 31, 2007 at 19:44 UTC
Hi mbdc566021, Your code looks like it has just been cobbled together with snippets from various sources. This is OK, but don't be afraid to make changes and experiment with the code to see what happens. Here are a few points: You do not need the while loop, which incidently is not completed anyway. In this context while(<READFILE>) is used to read the file line by line, but since you previously slurped the file, all of it has been read into the $filecontents variable. So a while loop is not needed here. Incrementing a hash as suggested by apl: $CharCount{$characters}++ Only works for this inside a loop structure. You are trying to combine two different techniques. You will need to experiment with your regexes to get the matches you want. If you have a copy of Programming Perl Study the chapter on Pattern Matching, and/or check the Tutorials section of Perlmonks. Also, see annotated code below: `# this matches each character once $filecontents =~ m/./g; # this version also puts a copy of each match into @characters my @characters = $filecontents =~ m/./g; # this counts the number of elements in @characters my $CharCount = scalar @characters; # to verify your regex is matching correctly # this will print a list of each item counted: for (@characters){ print "$_ \n"; }` [download] Good luck with your assignment.	[reply] [d/l]
Re^4: Statistics from a txtfile by blazar (Canon) on Jan 02, 2008 at 21:17 UTC
Re^5: Statistics from a txtfile by hangon (Deacon) on Jan 03, 2008 at 03:36 UTC
Re: Statistics from a txtfile by apl (Monsignor) on Dec 28, 2007 at 20:35 UTC
character count including whitespace the punctuation For each character in a line, increment a hash keyed on that character. (i.e. `$CharCount{ $ch }++;` ) number of words. paragraphs. lines and sentences. How would you determine the end of a word, a paragraph, or a sentence? Increment the appropriate counter when you hit that situation. How do you determine that you've read a line?	[reply] [d/l]
Re: Statistics from a txtfile by ww (Archbishop) on Dec 28, 2007 at 22:43 UTC
smells like homework... so, the nudge is: Read Learning Perl Read Perl Cookbook (stats answers can be found here) Mark homework as such when it is	[reply]
Re^2: Statistics from a txtfile by mbdc566021 (Initiate) on Jan 07, 2008 at 19:17 UTC
Dear Sirs, I have knuckled down and made some good ground work with the structure. I have managed to accept a given file name if certain values are met. I Have managed also to count the sentences but can i add the character count, paragraph count and word count from within the same WHILE loop and I am confused as to count paragraphs. Is there a code for carriage returns? #!/usr/bin/perl if ($#ARGV == -1) { print("Please enter filename: "); $filename = <STDIN>; chomp ($filename); } else { $filename = $ARGV[0]; } if ($filename -r && $filename=~m/^[_a-zA-Z]/) #if filename is readable + AND matches..... { open (READFILE, $filename)\|\| die "Failed to open $filename: $!"; } if ($filename !~ m/\.txt$/i) #if filename does not end with .txt then +add to filename { $filename .= ".TXT"; } my $filecontents; { local undef $/; $filecontents = <READFILE>; } close <READFILE>; #slurps the whole file into a variable $sentences = 0; my @characters = $filecontents =~ m/./g; # puts a copy of each match + into @characters my($ch); my $CharCount = scalar @characters; # this counts the number of + elements in @characters my @words = $filecontents =~m/[a-zA-Z]\s/g;# matches a char followed b +y a white space character globally while ($ch = getc(READFILE)) { # count sentences: if ($ch eq "?" \|\| $ch eq "!" \|\| $ch eq ".") # if character is one of the three end of sentence markers { $sentences++; } } while ($ch = getc(READFILE)) { $CharCount { $ch }++; } for (@characters) { print "There are $_ \n characters"; } print "There are $sentences sentences"; + # output data # open(OUT, ">data1.txt") \|\| die "data1.txt not open: $!"; # close(OUT); [download]	[reply] [d/l]
Re^3: Statistics from a txtfile by ww (Archbishop) on Jan 07, 2008 at 22:37 UTC
Re your question, "Is there a code for carriage returns?" Yes, `\n`, That's pretty basic but... Whether or how a carriage return defines a paragraph, in a grammatical sense is another question. Some definitions of a paragraph construe the combination of a period followed by a carriage return and a second <CR> in the next, otherwise empty line as a paragraph indicator. But others might consider any line beginning with indentation (eg, leading space(s) or tab) greater than that of the previous line as a paragraph indicator... and if you wish to stretch a bit, some plain text might invite the interpretation that any <CR> marks a paragraph end. How are you defining a paragraph? Similarly, your test for sentences, `if ($ch eq "?" \|\| $ch eq "!" \|\| $ch eq ".")` is incomplete because it fails to allow for the possibility that the sentence may contain an abbreviated word or words: `"Mr. John Doe, Jr. is a Sr. programmer for E.H.I., Inc. Miss Laura J. Smith is a Analyst for ABC. "` How many sentences are there? By inspection, I'm sure you'll agree, there are two. But your test for sentences will give you a much higher sentence count. As to the rest of your logic and syntax: Note that your code won't compile (running `perl -c yourcode` is a good idea before posting :-) as is (if this is the problem) double-checking that what you've posted matches what your thouight you posted. Even with all the syntactical issues fixed, I can't make your code extract the values you assert you've obtained. Regretably, I've run out of time and ambition to identify/clarify/correct all of those, but they're issues of higher precedence than your hope to do all the counting in a single `while` clause. At this point, I have to suspect that producing this relied as much on cutting and pasting snippets from hither and yon, as on study and comprehension. Note however, that among other things, you're trying to use `getc` on a filehandle that isn't open; that the use you would be making of getc if the filehandle were open would be redundant (you've already read all the chars; why not read them from `$filecontents?`) and .../me trails off in dismay....) Perhaps you'll get some inspiration from the Perl Cookbook (for instance chapter 8, and 8.2 especially).	[reply] [d/l] [select]