in reply to Re^2: count number of overlapping words in a document
in thread count number of overlapping words in a document

The code I gave you was written on the fly on the command line rather than being stored in a script file. Therefore it would require a little modification to be used as a stored script. The enclosing single quotes around the code and the -E flag would go and the command line -M flags would be incorporated in the script as

use strict; use warnings;

lines at the top of your code. I use q{...} and qq{...} instead of '...' and "..." because it makes it easier to write code on the command line in both Unix/Linux and MS Windows environments but they are fuctionally equivalent.

Some points about your translation:-

You could employ a do block to get the total number of hits by changing

$words{$1} ++ while $text =~ m{$rxWords}g;

to

my $totalHits; do { $totalHits ++; $words{$1} ++; } while $text =~ m{$rxWords}g;

I'll leave you to see if you can work out how to get the total number of words given these clues; the regex pattern \b\w+\b and the g match modifier. Play around with some simple test text and see if you can solve the problem for yourself then apply it to your real code. Doing is far and away the best way of learning!

I hope this helps you move forward.

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^4: count number of overlapping words in a document
by dmarcel (Initiate) on Sep 18, 2014 at 08:39 UTC

    Thank you very much. This 'education' will certainly help me in the future, and I made the adjustments you've recommended. The code is now clear and I understand how the code is set up. The last task was actually very simple in the end, I just did not realize I could use the regex for this task as well

    my $totalwords; $totalwords ++ while $text =~ m{\b[A-Za-z]+\b}g;

    I do have one small issue left to be resolved. After adding the do-loop to count the total number of hits, I get the following error: Use of unitialized value $1 in hash element. The code will function, but adds one empty 'word' to the final result with the count of one (thus the code always has a bias of +1) (and I get many errors when I use it for multiple files). Any quick fix available?

    Update: solved the problem through a foreach loop