drno has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm a new perl user and have tried to get this script http://www.perlmonks.org/index.pl?node_id=288692 from allolex working correctly. When I try to invoke the dictionary command it returns the last word (or nothing)in the dictionary text. When I invoke the glossary command it returns everything in both the dictionary and glossary. Can someone confirm that it works as describe or point me to a possible fix? I'm running perl on XP and have gzip and zlib installed. Thanks in advance for any help! Todd

#!/usr/bin/perl # POD can be found at the bottom of this script use strict; use warnings; use Compress::Zlib; use Getopt::Long; use Pod::Usage; my $VERSION = 0.81; my $dictfile = 'dict.gz'; # Process command-line options my %cl_options = ( help => '', version => '', token_debug => '', glossary_output => '', dictionary_output => '' ); GetOptions( 'help|?' => \$cl_options{help}, 'version' => \$cl_options{version}, 'man' => \$cl_options{man}, 'token-debug' => \$cl_options{token_debug}, 'glossary' => \$cl_options{glossary_output}, 'dictionary' => \$cl_options{dictionary_output} ); print "This is version $VERSION of $0.\n" if $cl_options{version}; exit(0) if ($cl_options{version}); pod2usage(-exitstatus => 0, -verbose => 1, -msg => "Help for $0") if $ +cl_options{help}; pod2usage(-exitstatus => 0, -verbose => 2, -msg => "Man page for $0") +if $cl_options{man}; my $file = shift; my %dictionary = readdict(\$dictfile); my %glossary; findwords(); printlexicon(\%dictionary) if $cl_options{dictionary_output}; printlexicon(\%glossary) if $cl_options{glossary_output}; # Readdict reads in the dictionary file defined above using # the Compress:Zlib CPAN module. It returns a hash that is # used for all further dictionary operations. # sub readdict { my $dict = shift; my %dicthash; my $gz = gzopen($$dict, "rb") or die "Cannot open $$dict: $gzerrno +\n" ; while ($gz->gzreadline($_) > 0) { chomp; $dicthash{lc($_)} = 0; } die "Error reading from $$dict: $gzerrno\n" if $gzerrno != Z_STREA +M_END ; return %dicthash; } # findwords() reads in a file and compares words found in the file # with the contents of the dictionary read in by the readdict # function. It assigns counts to the elements of %dictionary and # creates %glossary elements and increases its values according to # the number of matches. sub findwords { open my $if, "<", $file || die "Could not open $file: $!"; while (<$if>) { chomp; my @elements = split(/[ '-]/,$_); # split on hyphens, too foreach my $element (@elements) { next if $element =~ /\d/; # Don't need digits print "[$element]->" if $cl_options{token_debug}; $element = lc($element); $element =~ s/[\s,!?._;«»)("'-]//g; print "[$element]\n" if $cl_options{token_debug}; next if $element eq ''; if ( exists $dictionary{$element} ) { $dictionary{$element}++; } else { $glossary{$element}++; } } } } # Showmatches reads in a lexicon hash via a reference and prints all +words out # that have been seen in the findwords() function along with a freque +ncy count. # sub printlexicon { my $lexicon = shift; my $counter = 0; foreach my $key (sort keys %$lexicon) { if ( $$lexicon{$key} > 0 ) { print $key . " : " . $$lexicon{$key} . "\n"; $counter++; } } print "\n$counter entries total\n"; } __END__ =pod =head1 dict-compare A generic script for building dictionaries by comparing them to real-w +orld texts. =head1 DESCRIPTION This program compares the words in a given text file to a list of word +s from a dictionary file. It is capable of outputting lists of words that oc +cur or do not occur in a given dictionary file, along with their frequency in + the text. Debugging output using token tag marks is also available. =head1 SYNOPSIS C<dict-compare [--glossary --dictionary] [--token-debug] file > output +_file> =head2 OPTIONS =over 12 =item C<--help,-h,-?> Prints a usage help screen. =item C<--man,-m> Prints out the manual entry for $0 =item C<--version,-v> Prints out the program version. =item C<--glossary> Prints a glossary of words not found in the dictionary file and the nu +mber of times they occur. =item C<--dictionary> Prints out the words from the text that had a dictionary match, along +with their respective frequencies. =item C<--token-debug> Prints tags around each token in the text to help sound out strange to +kens. The tokens themselves are printed side-by-side to show how the script +cleans up the results. =back =head1 EXAMPLE C<dict-compare --glossary myfile.txt> This command reads in the text contained in myfile.txt and prints out +a list of words not found in the dictionary and their frequencies. =back =head1 DICTIONARY FORMAT The dictionary is a one-word-per-line file that has been gzipped. You +r dictionary can be anything. Think of the possibilities. =head1 THANKS The following people have reviewed and offered inprovements to this co +de: =over 12 =item B<Sauoq> L<http://www.perlmonks.org/index.pl?node_id=182681> =item B<adjelore> L<http://www.perlmonks.org/index.pl?node_id=131479> =item B<Hutta> L<http://www.perlmonks.org/index.pl?node_id=117788> =item B<TomDLux> L<http://www.perlmonks.org/index.pl?node_id=144696> =item B<Not_A_Number> L<http://www.perlmonks.org/index.pl?node_id=2587 +24> =back And of course all of the others at the Monastery, Cologne.pm whose hel +p can only be seen in its cumulative effect. =head1 AUTHOR Damon "allolex" Davison - <allolex@sdf.freeshell.org> =head1 LICENSE This code is released under the same terms as Perl itself. =cut

Replies are listed 'Best First'.
Re: Dict - Compare
by toolic (Bishop) on Mar 26, 2010 at 23:21 UTC

      Thanks for the reply. For the post I just copied and pasted directly from the link but I did remove the "+"s from when I ran the script. So it still isn't working correctly.

Re: Dict - Compare
by toolic (Bishop) on Mar 27, 2010 at 18:59 UTC
    It seems to work as advertised for me, but I tried it on Linux.
    $ cp /usr/share/dict/words ./dict $ gzip dict $ cat ./file.txt big dog bgfdsrt $ dict-compare -dictionary ./file.txt big : 1 dog : 1 $ dict-compare -glossary ./file.txt bgfdsrt : 1
    Where did you get your dict.gz file from? Start adding print statements throughout the code (see Basic debugging checklist).

      Thanks toolic. I just gzipped a text file in the one word-per-line format stated in the comments at the bottom of the text: =head1 DICTIONARY FORMAT The dictionary is a one-word-per-line file that has been gzipped. Your dictionary can be anything. Think of the possibilities. It seems like dict-compare is only recognizing the last word in the dictionary (with the -dictionary prompt). Is there something else I should be doing concerning the dict.gz file?

        As I mentioned, the script works for me. Therefore, I suspect there is a problem with your dict.gz file. That is why I suggested using print in your code. It does not sound like you have attempted to debug your problem yet.

        Another suggestion is to create a trivial dict file yourself with just 3 words in it:

        cat big dog

        Then, run the exact commands that I showed. You should get the output I posted.