Dict - Compare

drno has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm a new perl user and have tried to get this script http://www.perlmonks.org/index.pl?node_id=288692 from allolex working correctly. When I try to invoke the dictionary command it returns the last word (or nothing)in the dictionary text. When I invoke the glossary command it returns everything in both the dictionary and glossary. Can someone confirm that it works as describe or point me to a possible fix? I'm running perl on XP and have gzip and zlib installed. Thanks in advance for any help! Todd

#!/usr/bin/perl

#  POD can be found at the bottom of this script

use strict;
use warnings;
use Compress::Zlib;
use Getopt::Long;
use Pod::Usage;

my $VERSION = 0.81;
my $dictfile = 'dict.gz';

#  Process command-line options

my %cl_options = (
    help             =>     '',
    version         =>     '',
    token_debug         =>     '',
    glossary_output     =>     '',
    dictionary_output     =>     ''
);

GetOptions(
        'help|?'      => \$cl_options{help},
        'version'     => \$cl_options{version},
        'man'         => \$cl_options{man},
        'token-debug' => \$cl_options{token_debug}, 
        'glossary'    => \$cl_options{glossary_output},
        'dictionary'  => \$cl_options{dictionary_output}    
);

print "This is version $VERSION of $0.\n" if $cl_options{version};
exit(0) if ($cl_options{version});
pod2usage(-exitstatus => 0, -verbose => 1, -msg => "Help for $0") if $
+cl_options{help};
pod2usage(-exitstatus => 0, -verbose => 2, -msg => "Man page for $0") 
+if $cl_options{man};

my $file = shift;
my %dictionary = readdict(\$dictfile);
my %glossary;

findwords();

printlexicon(\%dictionary) if $cl_options{dictionary_output};
printlexicon(\%glossary) if $cl_options{glossary_output};


#  Readdict reads in the dictionary file defined above using
#  the Compress:Zlib CPAN module.  It returns a hash that is
#  used for all further dictionary operations.
#
sub readdict {
    my $dict = shift;
    my %dicthash;

    my $gz = gzopen($$dict, "rb") or die "Cannot open $$dict: $gzerrno
+\n" ;
    while ($gz->gzreadline($_) > 0) {
        chomp;
        $dicthash{lc($_)} = 0;
    }
    die "Error reading from $$dict: $gzerrno\n" if $gzerrno != Z_STREA
+M_END ;
    return %dicthash;
}

#  findwords() reads in a file and compares words found in the file
#  with the contents of the dictionary read in by the readdict
#  function.  It assigns counts to the elements of %dictionary and
#  creates %glossary elements and increases its values according to
#  the number of matches.

sub findwords {
    open my $if, "<", $file || die "Could not open $file: $!";
    while (<$if>) {
        chomp;
    my @elements = split(/[ '-]/,$_); # split on hyphens, too
        foreach my $element (@elements) {
            next if $element =~ /\d/; #  Don't need digits
            print "[$element]->" if $cl_options{token_debug};
            $element = lc($element);
            $element =~ s/[\s,!?._;Ťť)("'-]//g; 
            print "[$element]\n" if $cl_options{token_debug};
            next if $element eq '';
            if ( exists $dictionary{$element} ) {
                $dictionary{$element}++;
            } else {
                $glossary{$element}++;
            }
        }
    }
}

#  Showmatches reads in a lexicon hash via a reference and prints all 
+words out 
#  that have been seen in the findwords() function along with a freque
+ncy count.
#
sub  printlexicon {
    my $lexicon = shift;
    my $counter = 0;
    foreach my $key (sort keys %$lexicon) {
        if ( $$lexicon{$key} > 0 ) {
            print $key . " : " . $$lexicon{$key} . "\n";
            $counter++;
        }
    }
    print "\n$counter entries total\n";
}

__END__

=pod

=head1 dict-compare

A generic script for building dictionaries by comparing them to real-w
+orld texts.

=head1 DESCRIPTION

This program compares the words in a given text file to a list of word
+s from
a dictionary file.  It is capable of outputting lists of words that oc
+cur or
do not occur in a given dictionary file, along with their frequency in
+ the
text.  Debugging output using token tag marks is also available.

=head1 SYNOPSIS

C<dict-compare [--glossary --dictionary] [--token-debug] file > output
+_file>

=head2 OPTIONS 

=over 12

=item C<--help,-h,-?>

Prints a usage help screen.

=item C<--man,-m>

Prints out the manual entry for $0

=item C<--version,-v>

Prints out the program version.

=item C<--glossary>

Prints a glossary of words not found in the dictionary file and the nu
+mber of
times they occur.

=item C<--dictionary>

Prints out the words from the text that had a dictionary match, along 
+with
their respective frequencies.

=item C<--token-debug>

Prints tags around each token in the text to help sound out strange to
+kens.
The tokens themselves are printed side-by-side to show how the script 
+cleans
up the results.

=back

=head1 EXAMPLE

C<dict-compare --glossary myfile.txt>

This command reads in the text contained in myfile.txt and prints out 
+a list
of words not found in the dictionary and their frequencies.

=back

=head1 DICTIONARY FORMAT

The dictionary is a one-word-per-line file that has been gzipped.  You
+r dictionary can be anything.  Think of the possibilities.

=head1 THANKS

The following people have reviewed and offered inprovements to this co
+de:

=over 12

=item B<Sauoq> L<http://www.perlmonks.org/index.pl?node_id=182681>

=item B<adjelore> L<http://www.perlmonks.org/index.pl?node_id=131479>

=item B<Hutta> L<http://www.perlmonks.org/index.pl?node_id=117788>

=item B<TomDLux> L<http://www.perlmonks.org/index.pl?node_id=144696>

=item B<Not_A_Number> L<http://www.perlmonks.org/index.pl?node_id=2587
+24>

=back

And of course all of the others at the Monastery, Cologne.pm whose hel
+p
can only be seen in its cumulative effect.

=head1 AUTHOR

Damon "allolex" Davison - <allolex@sdf.freeshell.org>

=head1 LICENSE

This code is released under the same terms as Perl itself.

=cut
[download]

Comment on Dict - Compare Download Code

Replies are listed 'Best First'.
Re: Dict - Compare by toolic (Bishop) on Mar 26, 2010 at 23:21 UTC
Here is the link to that node: dict-compare: a dictionary evaluation script There are syntax errors in the code you pasted because your copy-and-paste did not go cleanly (the "+" continuation characters were inadvertenetly copied). See also: Linking on PerlMonks	[reply]
Re^2: Dict - Compare by drno (Initiate) on Mar 29, 2010 at 16:32 UTC
Thanks for the reply. For the post I just copied and pasted directly from the link but I did remove the "+"s from when I ran the script. So it still isn't working correctly.	[reply]
Re: Dict - Compare by toolic (Bishop) on Mar 27, 2010 at 18:59 UTC
It seems to work as advertised for me, but I tried it on Linux. `$ cp /usr/share/dict/words ./dict $ gzip dict $ cat ./file.txt big dog bgfdsrt $ dict-compare -dictionary ./file.txt big : 1 dog : 1 $ dict-compare -glossary ./file.txt bgfdsrt : 1` [download] Where did you get your `dict.gz` file from? Start adding print statements throughout the code (see Basic debugging checklist).	[reply] [d/l] [select]
Re^2: Dict - Compare by drno (Initiate) on Mar 29, 2010 at 16:41 UTC
Thanks toolic. I just gzipped a text file in the one word-per-line format stated in the comments at the bottom of the text: =head1 DICTIONARY FORMAT The dictionary is a one-word-per-line file that has been gzipped. Your dictionary can be anything. Think of the possibilities. It seems like dict-compare is only recognizing the last word in the dictionary (with the -dictionary prompt). Is there something else I should be doing concerning the dict.gz file?	[reply]
Re^3: Dict - Compare by toolic (Bishop) on Mar 29, 2010 at 16:52 UTC
As I mentioned, the script works for me. Therefore, I suspect there is a problem with your dict.gz file. That is why I suggested using print in your code. It does not sound like you have attempted to debug your problem yet. Another suggestion is to create a trivial dict file yourself with just 3 words in it: `cat big dog` [download] Then, run the exact commands that I showed. You should get the output I posted.	[reply] [d/l]
Re^4: Dict - Compare by drno (Initiate) on Mar 29, 2010 at 17:41 UTC