count number of overlapping words in a document

dmarcel has asked for the wisdom of the Perl Monks concerning the following question:

I am beginning perl user, so I'm sorry in advance if I miss some easy things, but I cannot get it to work and I am not sure whether every step is necessary(I have problems in particular with the while loops and 'saving' its outcome). I would like to do the following:

I have two text files. File 1 with a word list (one word per line) and File 2 that is a regular text document with text and numbers. I would like to count the number of words in file 2 that I can also find in file 1 and the total number of words in the document (so that I can calculate the percentage of corresponding words)

this is how far I have come:

use strict;
use warnings;

#Part one of the code, the wordlist is a file with one word per line a
+nd I transform this into a hash

    my $filename = "wordlist.txt";
    open(INPUT, $filename) or die "Cannot open $filename";

    my $line = <INPUT>;
    while($line = <INPUT>){
    chomp($line);
    my @words = split(/\s+/, $line);
    
my %unique = map {$_ => 1 } @words;
my @unique = %unique;

#part two of code, open the text file and extract words only (because 
+the file also includes many numbers), and count the number of occuren
+ce stored in a hash

open (DATA, "4.txt") or die;

my @UnNum;
my $x;
my %dict;

while (<DATA>) {
    chomp;
      $_ = lc;   
      s/ -- / /g; 
      s/ - / /g; 
      s/ +/ /g;   
      s/[.,:;?"!_()\[\]]//g; 
  
   my @UnNum = split(/\s+/);

foreach $x(@UnNum){
if ($x =~  /(([a-zA-Z']+-)*[a-zA-Z']+)/ ){
    ++$dict{$x};
}}}

#part 3 of the code, I try to compare the two different hashes and add
+ the total number of occurrences

while ((my $words,my $number) = each (%dict))
{my $total+= $number; 
if (exists($unique{$words})){
my $corresponding +=$number;

print "There are $corresponding corresponding words of in total $total
+ words";}}

}
[download]

thank you in advance for the help!

Comment on count number of overlapping words in a document Download Code

Replies are listed 'Best First'.
Re: count number of overlapping words in a document by johngg (Canon) on Sep 16, 2014 at 13:59 UTC
I'm not quite sure from your code exactly what you are aiming at but I'd approach the task with a hash for the words in the first file. From there I'd construct a regex with capturing alternation of the keys of that hash surrounded by word boundaries to avoid false hits. I'd then slurp the whole of the second file into a single variable and do a global regex match, incrementing the values of the hash when a match was found and captured in `$1`. $ perl -Mstrict -Mwarnings -E ' open my $wordsFH, q{<}, \ <<EOF or die $!; cat dog EOF my %words = map { chomp; $_ => 0 } <$wordsFH>; my $rxWords = do { local $" = q{ \| }; qr{(?x) \b ( @{ [ keys %words ] } ) \b }; }; say qq{Regex is $rxWords}; open my $textFH, q{<}, \ <<EOF or die $!; The cat scattered doggerel words over the poor dog as it doggedly ignored the catastrophe the cat was causing EOF my $text = do { local $/; <$textFH>; }; $words{ $1 } ++ while $text =~ m{$rxWords}g; say qq{$_ => $words{ $_ }} for sort keys %words;' Regex is (?^u:(?x) \b ( dog \| cat ) \b ) cat => 2 dog => 1 $ [download] I hope this is helpful but ask further if I have misunderstood or anything is unclear. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: count number of overlapping words in a document by dmarcel (Initiate) on Sep 16, 2014 at 18:08 UTC
Dear JohnGG Thank you very much for your prompt reply! I really appreciate it! I did run into some small problems because my computer did not recognize the code(had many errors). After some research I found out that this is because I using windows (I feel even more beginner now for not figuring that out), but as a result I had some difficulties understanding the code because I am used to a completely different way of writing code(some lines are still difficult for me to understand, but I get what you're doing) Could you maybe check whether this 'translation' is correct, because when I replace the first line by using strict and warnings it provides me with 2 errors later in the code `$ perl -Mstrict -Mwarnings -E ; my $wordsFH = "woord.txt"; open(INPUT, $wordsFH) or die "Cannot open $filename"; my %words = map {chomp; $_ => 0 } <INPUT>; my $rxWords = do { local $" = q{ \| }; qr{(?x) \b ( @{ [ keys %words ] } ) \b }; }; print "Regex is $rxWords"; my $textFH = "4.txt"; open (texting, $textFH) or die; my $text = do { local $/; <texting>; }; $words{$1} ++ while $text =~ m{$rxWords}g; print "$_ => $words{ $_ }}" for sort keys %words;` [download] Furhtermore, I still have 2 problems that I need to tackle and of which I am not sure how to handle them. 1 Is it possible to extract the total number of 'hits'? $total ++ while $text =~ m{$rxWords}g; works, but may I incorporate it so that I do not have to use the regex twice. 2 Is there a simple addition to calculate the total number of words? (is setting up an array with chop the easiest solution?) Again, thank you so much for your time!	[reply] [d/l]
Re^3: count number of overlapping words in a document by johngg (Canon) on Sep 17, 2014 at 11:44 UTC
The code I gave you was written on the fly on the command line rather than being stored in a script file. Therefore it would require a little modification to be used as a stored script. The enclosing single quotes around the code and the `-E` flag would go and the command line `-M` flags would be incorporated in the script as `use strict; use warnings;` [download] lines at the top of your code. I use `q{...}` and `qq{...}` instead of `'...'` and `"..."` because it makes it easier to write code on the command line in both Unix/Linux and MS Windows environments but they are fuctionally equivalent. Some points about your translation:- Be consistent with indenting, indent code that falls within inner logical scopes so that all code at the same logical level starts in the same column. Indent continuation lines as well but a different amount from logical indentation to avoid confusion. Use the three-argument form of open and lexical rather than package filehandles and when you die of failure show the o/s error message held in `$!` as well, see perlvar. My normal practice is to use camelCase for my identifiers and, when applied to files that I am opening, I would use, say, `$wordsFile` to hold the name of the file I want to open and `$wordsFH` for the lexical filehandle I open it against and `<$wordsFH>` to read it. You might have a typo, is your file called "woords.txt" or "words.txt" and where did you get `$filename` from? Lose the second closing brace in the last line, it is not an error but you forgot to remove it when you converted to double-quotes and it will make your output look untidy. You could employ a do block to get the total number of hits by changing `$words{$1} ++ while $text =~ m{$rxWords}g;` [download] to `my $totalHits; do { $totalHits ++; $words{$1} ++; } while $text =~ m{$rxWords}g;` [download] I'll leave you to see if you can work out how to get the total number of words given these clues; the regex pattern `\b\w+\b` and the `g` match modifier. Play around with some simple test text and see if you can solve the problem for yourself then apply it to your real code. Doing is far and away the best way of learning! I hope this helps you move forward. Cheers, JohnGG	[reply] [d/l] [select]
Re^4: count number of overlapping words in a document by dmarcel (Initiate) on Sep 18, 2014 at 08:39 UTC