Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Here is a weird project I'm trying to do, how can you open a txt and read out all uniques and print the occurence of each word? Ie the file contains: I used to be. I had the answers to everything. But now I know, life doesn't always go my way. Output:
I: 3 Used: 2 To: 2 Be: 1 ....
What I think I have to do is read the entire file into an array then print out the words from that, then use a foreach on whitespace to $count++ but from there I wouldn't know how to combine the two. Any suggestions other than leading me to the two other posts on this site about it (different project) would be helpful!

Replies are listed 'Best First'.
Re: Word Counting (contractions)
by tye (Sage) on Apr 24, 2003 at 21:37 UTC

    I don't find splitting on spaces to be very good at picking out "words". If you just want to count the total number of words, then it works pretty well. But for your task, I find the quite simple:

    @words= $line =~ /(\w+(?:'\w+)?)/g;
    to be much more effective. It isn't perfect. If you have numbers and/or underscores in your text and you want to ignore them and/or you want to handle non-English letters, then a better version is:
    @words= $line =~ /([[:alpha:]]+(?:'[[:alpha:]]+)?)/g;

    These match the common contractions (like "don't", "isn't", "aren't", and "I've" that I've used) but aren't bothered by 'quoting'.

                    - tye

    Update: Even better, allow hyphenated-word matching:

    @words= $line =~ /([[:alpha:]]+(?:[-'][[:alpha:]]+)*)/g;
      So far I can get the total word count set by using the code below. From there I don't know how to place each word into their own hash and count per word rather than count the total number. I knew from the beginning I had to use a regex for this to be more accurate and I think I'll go with the last one you posted but using that confuses me. Would I have to change my for loop to a foreach (keys @words) { @words =~ ...} ?
      my $file = "test.txt"; my $count = "0"; open (FILE, $file) or die "Error $!"; my $words = <FILE>; $count++ for split /\s+/, $words; print "Count: $count\n"; print "Words: $words\n"; close FILE;
        I suspect a few people may be giving up in head-banging-on-the-desk frustration by now..., so I'll give it a shot. :)

        You're so nearly there...you just need to replace a single line there with the line that perlplexer already gave you, which loops over the list from 'split' (using 'for' or 'foreach' - they're synonymous, so you use whichever one you think looks nicer - don't you love a language that takes aesthetics into account?) and increments the value in a corresponding 'words' hash. I'm not going to tell you though which line that is I'm afraid, 'cos methinks the student protesteth overmuch about this not being homework...I prescribe a healthy dose of Camel :)

        Cheers,
        Ben

Re: Word Counting
by perlplexer (Hermit) on Apr 24, 2003 at 20:44 UTC
    Wow... when is this homework due? ;)

    1. Open the file
    2. Read it line by line
    3. Split each line into words (on spaces)
    4. Add each word to a hash (this will ensure uniqueness)
    5. When all lines are processed, print hash keys


    --perlplexer
      I would reccomend $wordhash{lc $word}++; as opposed to $wordhash{$word}++; if you're not worried about case-sensitivity, so you don't end up with:

      To: 50
      to: 23

      Also, if you're going to split on \s+ as everyone seems to be suggesting, don't forget you're going to have to $word =~ s/\W//g to get rid of those unsightly punctuation marks.

      --
      perl: code of the samurai

      This definately isn't homework, heh...sometimes I wish I were still in school though. I know how to open and read files, but I don't understand part #3. How do I split on whitespace? You'd have to split each word into their own variable, that's the part I don't understand.
        You can use split to put the words into an array or a list:
        my $line = "This is a line of text"; my @words = split /\s+/, $line;
        Note that the /\s+/ bit technically isn't needed, because split defaults to splitting on whitespace.

        There's more info on split on Perldoc.

        Hope that helps,
        -- Foxcub
        A friend is someone who can see straight through you, yet still enjoy the view. (Anon)

        Use split(); e.g.,
        # Assuming %words is the hash where you keep all the words $words{$_}++ for split /\s+/, $line;
        --perlplexer
Re: Word Counting
by hacker (Priest) on Apr 24, 2003 at 23:38 UTC
    I'll give this a try, this should get you going. How about:
    my %count; my $size = 4; # minimum size of words to count while (<>) { my @words = split; foreach my $word (@words) { next if length($word) < $size; # Count only "real" words, and words with # hyphens and apostrophes, like /foo-bar/ # and /don't/ and /ma'am/ $count{$word}++ if $word =~ m/^[a-z][A-Z]+([-'][a-z][A-Z]+)*$/; } } my %uniq = keys %count; unless (%uniq) { die "No words in the file\n"; } foreach my $word (keys %count) { $final_count += $count{$word}; $final_length += length($word); } # Do some other stuff with your hash

    Update: Fixed a small typo

      I was wondering where $word gets it's value in your code. I set @words to <FILE> and the script ran without any known errors but it kept saying there weren't any words in file (as per your die). I tossed in a print $_ after I setup @words to <FILE> and it printed out the document, so we know there's nothing wrong with accessing and reading that file.

      I think it has to do with $words not having a value and from your script I can't tell what I'm supposed to give it. One thing I noticed was $count($word) kept on giving me errors so I had to change it to $count{$word}, that could also be where the error resides.

      I don't really see how perlplexer's line will solve my problem with displaying each word into a hash and count them, but for the sake of it I'll replace that line in my script.

      Thanks.

Re: Word Counting
by The Mad Hatter (Priest) on Apr 24, 2003 at 21:00 UTC
    Update I see you've elaborated above with what you don't know how to do. That's good.

    Boy, this sure sounds like homework. Whatever the case, and especially if it is homework, I'd suggest you at least try and write it yourself (it really isn't hard) by following perlplexer's steps. If you still can't manage to do it, come back with code to show and explain what you've tried. Then maybe you'll get better responses.