Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have the following PERL script for manipulating a text file and measuring sentence length.

foreach $sentence(@sentences) { #print FILE "$sentence\n"; @words = split(/[^\w'a-zA-Z0-9_'-?]/,$sentence); $Counter =0; foreach $word(@words){ $Counter = $Counter+1; print ("$word\n"); } $sentence_count{($Counter)} = $sentence_count{($Counter)}+1; } while (($sentence_count,$word_count) = each(%sentence_count)) { print ("There are $word_count sentences of $sentence_count words\n +"); } <P>
And for some reason it counts a slightly different number of words! But not like one less or one more, it's not at all consistent!! WHY oh WHY will it not count the numbers of words correctly??? Can anyone help me? Am I doing something REALLY dumb?? Katy M

Replies are listed 'Best First'.
Re: Sentence Measurer
by Beatnik (Parson) on Apr 11, 2001 at 22:17 UTC
    Lingua::EN::Sentence can split up text into nice english sentences, a plain length on those sentences would get you what you requested...

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: Sentence Measurer
by mirod (Canon) on Apr 11, 2001 at 22:18 UTC

    You can simplify your code quite a bit:

    #!/bin/perl -w use strict; my %sentence_count; foreach my $sentence(<DATA>) { $sentence=~ s/^\W+//; # remove leading non-words my $counter = split(/\W+/,$sentence); # split on non-words sequenc +e (\W+) # in scalar context split wi +ll return # the number of elements in +the generated list, # no need to count them ($co +unt= @word in your # example would work too) $sentence_count{($counter)}++; } while ( my($sentence_count,$word_count) = each(%sentence_count)) { print ("There are $word_count sentences of $sentence_count +words\n"); } __DATA__ one one two, two two. two two, two. three three three three three three three three three. three three, three
Re: Sentence Measurer
by larsen (Parson) on Apr 11, 2001 at 23:02 UTC
Re: Sentence Measurer
by twerq (Deacon) on Apr 11, 2001 at 22:20 UTC
    Something like split " ", $sentence should be sufficent for counting words in a scalar. .

    Try using this:
    my %sentence_count; my @sentences = ( "Hello, how are you doing today?", "Where is the bathroom, pablo?", "My feet have the most beautiful odour!", "It's five o'clock" ); foreach (@sentences) { $sentence_count{scalar(split " ",$_)}++; } foreach (keys %sentence_count) { print "$sentence_count{$_} sentences have $_ words\n"; }
Re: Sentence Measurer
by suaveant (Parson) on Apr 11, 2001 at 22:14 UTC
    well, for one you could put a + after your character class, so you don't count a word when you have something like two spaces side by side...

    you have a lot of repetition in your character class... \w is the same as ummm... a-zA-Z0-9_ (pretty sure), but that shouldn't prevent it from working... really, what is wrong with splitting on whitespace, \s+
    That should give you a decent count.

    as an aside, you can do $Counter++ to add one to counter, or even $Counter += 1;

                    - Ant

Re: Sentence Measurer
by c-era (Curate) on Apr 11, 2001 at 22:15 UTC
    It works for me, this is what I used:
    @sentences = ("one sentence","two sentences, but not realy","a tesing +test."); foreach $sentence(@sentences) { #print FILE "$sentence\n"; @words = split(/[^\w'a-zA-Z0-9_'-?]/,$sentence); $Counter =0; foreach $word(@words){ $Counter = $Counter+1; print ("$word\n"); } $sentence_count{($Counter)} = $sentence_count{($Counter)}+1; } while (($sentence_count,$word_count) = each(%sentence_count)) { print ("There are $word_count sentences of $sentence_count words\n +"); }
    Maybe you aren't getting the right thing in @sentences?