in reply to Dict - Compare

It seems to work as advertised for me, but I tried it on Linux.
$ cp /usr/share/dict/words ./dict $ gzip dict $ cat ./file.txt big dog bgfdsrt $ dict-compare -dictionary ./file.txt big : 1 dog : 1 $ dict-compare -glossary ./file.txt bgfdsrt : 1
Where did you get your dict.gz file from? Start adding print statements throughout the code (see Basic debugging checklist).

Replies are listed 'Best First'.
Re^2: Dict - Compare
by drno (Initiate) on Mar 29, 2010 at 16:41 UTC

    Thanks toolic. I just gzipped a text file in the one word-per-line format stated in the comments at the bottom of the text: =head1 DICTIONARY FORMAT The dictionary is a one-word-per-line file that has been gzipped. Your dictionary can be anything. Think of the possibilities. It seems like dict-compare is only recognizing the last word in the dictionary (with the -dictionary prompt). Is there something else I should be doing concerning the dict.gz file?

      As I mentioned, the script works for me. Therefore, I suspect there is a problem with your dict.gz file. That is why I suggested using print in your code. It does not sound like you have attempted to debug your problem yet.

      Another suggestion is to create a trivial dict file yourself with just 3 words in it:

      cat big dog

      Then, run the exact commands that I showed. You should get the output I posted.

        I used the three word dictionary as you suggested. Dict-compare is only recognizing the last entry in the dictionary.

        dict.gz = cat big dog

        file.txt = fish bird big lizard mammal

        "big" is recognized with -dictionary only when it is listed last.

        I've used print statements in the dictionary (sub readdict) to confirm that it is being read correctly as well as in the glossary (sub findwords).

        I'm pasting dict.gz and file.txt into my perl directory. Dict.gz is just a one-word-per line text file that is gzipped in gzip's directory and pasted into the perl directory.

        Thanks again. If you have a tip on where I should try to debug I'll definitely try it.

        #!/usr/bin/perl # POD can be found at the bottom of this script use strict; use warnings; use Compress::Zlib; use Getopt::Long; use Pod::Usage; my $VERSION = 0.81; my $dictfile = 'dict.gz'; # Process command-line options my %cl_options = ( help => '', version => '', token_debug => '', glossary_output => '', dictionary_output => '' ); GetOptions( 'help|?' => \$cl_options{help}, 'version' => \$cl_options{version}, 'man' => \$cl_options{man}, 'token-debug' => \$cl_options{token_debug}, 'glossary' => \$cl_options{glossary_output}, 'dictionary' => \$cl_options{dictionary_output} ); print "This is version $VERSION of $0.\n" if $cl_options{version}; exit(0) if ($cl_options{version}); pod2usage(-exitstatus => 0, -verbose => 1, -msg => "Help for $0") if $ +cl_options{help}; pod2usage(-exitstatus => 0, -verbose => 2, -msg => "Man page for $0") +if $cl_options{man}; my $file = shift; my %dictionary = readdict(\$dictfile); my %glossary; findwords(); printlexicon(\%dictionary) if $cl_options{dictionary_output}; printlexicon(\%glossary) if $cl_options{glossary_output}; # Readdict reads in the dictionary file defined above using # the Compress:Zlib CPAN module. It returns a hash that is # used for all further dictionary operations. # sub readdict { my $dict = shift; my %dicthash; my $gz = gzopen($$dict, "rb") or die "Cannot open $$dict: $gzerrno +\n" ; while ($gz->gzreadline($_) > 0) { chomp; $dicthash{lc($_)} = 0; print "Dictionary $_\n"; } die "Error reading from $$dict: $gzerrno\n" if $gzerrno != Z_STREA +M_END ; return %dicthash; } # findwords() reads in a file and compares words found in the file # with the contents of the dictionary read in by the readdict # function. It assigns counts to the elements of %dictionary and # creates %glossary elements and increases its values according to # the number of matches. sub findwords { open my $if, "<", $file || die "Could not open $file: $!"; while (<$if>) { chomp; my @elements = split(/[ '-]/,$_); # split on hyphens, too foreach my $element (@elements) { next if $element =~ /\d/; # Don't need digits print "[$element]->" if $cl_options{token_debug}; $element = lc($element); $element =~ s/[\s,!?._;«»)("'-]//g; print "[$element]\n" if $cl_options{token_debug}; next if $element eq ''; if ( exists $dictionary{$element} ) { $dictionary{$element}++; } else { $glossary{$element}++; print "Text: @elements\n"; } } } } # Showmatches reads in a lexicon hash via a reference and prints all +words out # that have been seen in the findwords() function along with a freque +ncy count. # sub printlexicon { my $lexicon = shift; my $counter = 0; foreach my $key (sort keys %$lexicon) { if ( $$lexicon{$key} > 0 ) { print $key . " : " . $$lexicon{$key} . "\n"; $counter++; } } print "\n$counter entries total\n"; } __END__