Extracting common words

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Extracting common words
by Athanasius (Cardinal) on Oct 22, 2015 at 03:41 UTC

The first while loop is working correctly: it extracts the words from the first text file, removes non-alphanumeric characters, converts the words to lower case, and stores them as the keys in a hash. (Here, as often in Perl, the value stored with the key is irrelevant.) So far, so good. (Note: the line $line =~ s/[[:punct:]]//g; isn’t needed, because all punctuation characters are removed in the subsequent substitution: $word =~ s/[^A-Za-z0-9]//g;.)

The second while loop is more complicated, and that’s where things fall apart. I don’t think I follow the intended logic here; but, rather than try to fix it, it will be simpler — and clearer — to re-think the algorithm.

The easiest approach is to simply repeat the logic of the first while loop and create a second hash containing the words in the second text file. (It will help if you rename the first hash %results to something like %words1, then the second hash can be named %words2.) You now have only to find which keys are common to both hashes, and that will give you the desired result:

my $counter = 0;

for my $key (sort keys %words1)
{
    if (exists $words2{$key})
    {
        ++$counter;
        print $key, "\n\n";
    }
}

print "Found $counter words in common\n";
[download]

You may also benefit from studying the FAQs in Data: Hashes (Associative Arrays).

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^2: Extracting common words

by AppleFritter (Vicar) on Oct 22, 2015 at 09:34 UTC

The easiest approach is to simply repeat the logic of the first while loop and create a second hash containing the words in the second text file. (It will help if you rename the first hash %results to something like %words1, then the second hash can be named %words2.) You now have only to find which keys are common to both hashes, and that will give you the desired result:

This is good advice; there's also List::Compare::Functional, which may make it easier to get the intersection of the two, e.g. (untested):

use feature qw/say/;
use List::Compare::Functional qw/get_intersection/;

# ...

my @common  = get_intersection( [keys %words1], [keys %words2] );
my $counter = scalar @common;

# output common words
say join "\n\n", @common;

# output count;
say "Found $counter words in common."
[download]

EDIT: of course it's $counter, not $common; thanks Athanasius. (See what I meant about "untested"?)

[reply]
[d/l]

Re: Extracting common words
by GrandFather (Saint) on Oct 22, 2015 at 04:03 UTC

Reworking your code slightly to use "built in" files results are as expected:

#!/usr/bin/perl
use strict;
use warnings;

=pod Removed original file name code to make sample self contained.

my $f1 = shift;
my $f2 = shift;

if (! defined($f1) or ! defined($f2)) {
    die "Need two text file names as arguments.  \n";
}

=cut

my $file1Content = <<CONTENT;
red green blue red orange
CONTENT

my $file2Content = <<CONTENT;
yellow orange
red grey purple
CONTENT

my %results;

open my $file1, '<', \$file1Content;
while (my $line = <$file1>) {
    $line =~ s/[[:punct:]]//g;

    for my $word (split(/\s+/, $line)) {
        $word =~ s/[^A-Za-z0-9]//g;
        $results{lc $word} = 1;
    }
}


my @words2;
my @storage;

open my $file2, '<', \$file2Content;

while (my $line = <$file2>) {
    $line =~ s/[[:punct:]]/ /g;
    @words2 = grep {/\S/} split(/ /, $line);
    for (my $i = 0; $i < scalar @words2; $i++) {
        $words2[$i] = lc($words2[$i]);
        $words2[$i] =~ s/[^A-Za-z0-9]//g;

        push(@storage, $words2[$i]);
        if (grep {$_ eq $words2[$i]} @storage[0 .. $#storage - 1]) {
            $results{$words2[$i]} = 1;
        } else {
            $results{$words2[$i]}++;
        }
    }
}

my $counter = 0;
foreach my $words (sort {$results{$b} <=> $results{$a}} keys %results)
+ {
    if ($results{$words} > 1) {
        $counter = $counter + 1;
        print $words, "\n\n";
    }
}

printf "Found %1.0f words in common\n", $counter;
[download]

Prints:

orange

red

Found 2 words in common
[download]

Maybe you can provide "file contents" that fail in the way you didn't describe?

Of course, the code can be cleaned up a little:

#!/usr/bin/perl
use strict;
use warnings;

my $file1Content = <<CONTENT;
red green blue red orange
CONTENT

my $file2Content = <<CONTENT;
yellow orange
red grey purple
CONTENT

my %group1;

open my $file1, '<', \$file1Content;
while (my $line = <$file1>) {
    my @words = map {lc} grep {$_} split /[\W\d]+/, $line;
    
    $group1{$_} = $_ for @words;
}


my %common;

open my $file2, '<', \$file2Content;

while (my $line = <$file2>) {
    my @words = map {lc} grep {/\S/} split /[\W\d]+/, $line;
    
    $common{$_} = $_ for grep {exists $group1{$_}} @words;
}

print "$_\n\n" for sort values %common;
printf "Found %1.0f words in common\n", scalar keys %common;
[download]

Premature optimization is the root of all job security

[reply]
[d/l]
[select]

Re: Extracting common words
by Anonymous Monk on Oct 22, 2015 at 05:15 UTC

#!/usr/bin/perl

# http://perlmonks.org/?node_id=1145607

use File::Slurp;
use strict;
use warnings;

my $f1 = shift;
my $f2 = shift;

my (%hash1, %hash2);
$hash1{$_} = 1 for lc(read_file $f1) =~ /[A-Za-z0-9]+/g;
$hash2{$_} = 1 for lc(read_file $f2) =~ /[A-Za-z0-9]+/g;

my $count = my @common = sort grep $hash2{$_}, keys %hash1;
print "count: $count\n@common\n";
[download]

[reply]
[d/l]

Re^2: Extracting common words

by Anonymous Monk on Oct 22, 2015 at 16:22 UTC

$script =~ s/A-Z//g;

[reply]
[d/l]

Re: Extracting common words
by Old_Gray_Bear (Bishop) on Oct 22, 2015 at 22:41 UTC

open file1
loop over file 1
    extract each $word
    $results{$word} = 1
    end-loop
close file1

open file2
loop over file2
    extract each $word
    $results{$word" += 2 #### Note: '+=', not '='
    end-loop
close file2

##
### %results contains all of the words in both files.
### Words that appear in both files will have a value of '3'.
### Iterate over the %results keys and print only the
###   when the value is '3'.
##
[download]

----
I Go Back to Sleep, Now.

OGB

[reply]
[d/l]

Re^2: Extracting common words

by BrowserUk (Patriarch) on Oct 22, 2015 at 23:08 UTC

## ### %results contains all of the words in both files. ### Words that appear in both files will have a value of '3'. ### Iterate over the %results keys and print only the ### when the value is '3'. ##
[download]

If a word appears once in the first file and twice in the second, the value will 5; if it appears 3 times in the second file ...

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]

Re^2: Extracting common words

by Anonymous Monk on Oct 22, 2015 at 23:39 UTC

$results{$word" |= 2 #### Note:  '|=', not '+=' and not '='
[download]

[reply]
[d/l]

Re: Extracting common words
by Anonymous Monk on Oct 22, 2015 at 02:05 UTC

use autodie;

[reply]