in reply to Re^2: Check word presence WITHOUT hashes or grep
in thread Check word presence WITHOUT hashes or grep

Let us know if there's some particular part you get stuck on.
  • Comment on Re^3: Check word presence WITHOUT hashes or grep

Replies are listed 'Best First'.
Re^4: Check word presence WITHOUT hashes or grep
by gojippo (Novice) on May 01, 2008 at 02:56 UTC
    Hello holy monks. I gave it a try, but using the following script I made just prints out every word, even if it does exist in the dictionary file. Could you point me to where I'm wrong ? I'd really appreciate it.


    #!/usr/bin/perl #This script is used to extract words not found in the dictionary file + from corpus data. For this, we use binary search. Linear source woul +d take too long and use too much resources. use strict; use warnings; #Use encode because of special characters. use encoding "utf8"; use open IN => "utf8"; use open OUT => "utf8"; binmode STDIN => "utf8"; binmode STDOUT => "utf8"; my $wordlist = shift; my @allwords; #array containing all dictionary words. #First, I open the dictionary file. I then push all words into the all +words array. open WORDLIST, $wordlist; while (<WORDLIST>){ chomp; s/\r//; my $word = $_; push (@allwords,$word) } close WORDLIST; #I then sort the array in alphabetic order. my @sorted_wordlist = sort {$a cmp $b} @allwords; #I create a subroutine to use binary search. sub binary_search { my ($array, $target) = @_; #set arguments for future use : $array will be the sorted wordlist a +nd $target, the word we will be looking for. my ($low, $high) = (0, @$array - 1); #Declare high and low indexes. Low index = 0 and high index = last i +ndex of the array. while ($low < $high) { # If high index is higher than the low index, + keep the window open. my $cur = int($low+$high)/2; #Declare a middle, which is the total + of high index and low index /2. if ($array->[$cur] lt $target) { $low = $cur + 1; #If the target is too small, try lower. } else { $high = $cur; #Else, try higher. } } } # Open the corpus data. while (<>){ chomp; s/\r//; my $corpus_word = $_; #Declare the read line as a corpus word. my $index = binary_search (\@sorted_wordlist, $corpus_word); #use +the binary search to find the index if($index < @sorted_wordlist && $sorted_wordlist[$index] eq $corpu +s_word){ #If found, do nothing. } else{ print "$corpus_word\n"; #If not, print. } }
      You aren't explicitly returning anything from binary_search(), so the return is the value of the last expression executed, which happens to be false. So your $index is always numerically 0, and the word is printed unless it was the first word in the array. Having a binary search return the last index examined can make sense when you may want to insert a new element if it wasn't found, but if you just want to know whether the target was found or not, a boolean return would make more sense.

      Your binary_search has a few problems. You need to do the int() after dividing by 2, or your limits will be non-integers and you'll see odd errors. You have an if/else in your while loop, but there are three possible cases (eq, lt, gt) you need to take into consideration. You need to set $high to $cur + 1 - 1, not $cur, or you may have an endless loop.

        Ok, here is my script after some corrections:

        #!/usr/bin/perl #This script is used to extract words not found in the dictionary file + from corpus data. For this, we use binary search. Linear source woul +d take too long and use too much resources. use strict; use warnings; #Use encode because of special characters. use encoding "utf8"; use open IN => "utf8"; use open OUT => "utf8"; binmode STDIN => "utf8"; binmode STDOUT => "utf8"; my $wordlist = shift; my @allwords; #array containing all dictionary words. #First, I open the dictionary file. I then push all words into the all +words array. open WORDLIST, $wordlist; while (<WORDLIST>){ chomp; s/\r//; my $word = $_; push (@allwords,$word) } close WORDLIST; #I then sort the array in alphabetic order. my @sorted_wordlist = sort {$a cmp $b} @allwords; #I create a subroutine to use binary search. sub binary_search { my ($array, $target) = @_; #set arguments for future use : $array will be the sorted wordlist a +nd $target, the word we will be looking for. my ($low, $high) = (0, @$array - 1); #Declare high and low indexes. Low index = 0 and high index = last i +ndex of the array. while ($low < $high) { # If high index is higher than the low index, + keep the window open. my $cur = int(($low+$high)/2); #Declare a middle, which is the tot +al of high index and low index /2. if ($array->[$cur] lt $target) { $low = $cur + 1; #If the target is too small, try lower. } elsif ($array->[$cur] gt $target) { $high = $cur - 1; #Else, try higher. } else{ return $cur; #Got it! } } return; #It doesn't exist. } # Open the corpus data. while (<>){ chomp; s/\r//; my $corpus_word = $_; #Declare the read line as a corpus word. my $index = binary_search (\@sorted_wordlist, $corpus_word); #use +the binary search to find the index if($index == 0){ #if index is not returned, then the word doesn't +exist. print "$corpus_word\n"; } else{ } }

        It did give me some results, but :
        1)I'm getting words that ARE in the dictionnary file.
        2)it gives me the use of unitialized value in numeric eq (==).

        Looking at the "Mastering algorithms with perl" O'Reilly Book, it turned out that I had to do $high = $cur - 1 and not +1, to adjust $high.
        I don't know why I'm still getting words that exist in the dictionnary file, but I understand why I get the error message, but don't know how to make it cleaner. Any ideas ?

      Try stripping your code down to a sample that tests the binary search:

      use strict; use warnings; my @allwords = sort qw(array will be the sorted wordlist and target); sub binary_search { my ($array, $target) = @_; #set arguments for future use : $array will be the sorted wordlist + and $target, the word we will be looking for. my ($low, $high) = (0, @$array - 1); #Declare high and low indexes. Low index = 0 and high index = last + index of the array. while ($low < $high) { # If high index is higher than the low index, keep the window + open. my $cur = int ($low + $high) / 2 ; #Declare a middle, which is the total of high index and +low index /2. if ($array->[$cur] lt $target) { $low = $cur + 1; #If the target is too small, try lower +. } else { $high = $cur; #Else, try higher. } } } for my $word (qw (search wordlist)) { my $index = binary_search (\@allwords, $word); next if $index >= @allwords; if ($allwords[$index] eq $word) { print "Found $word at $index\n"; } else { print "Couldn't find $word\n"; } }

      Prints:

      Couldn't find search Couldn't find wordlist

      which seems to indicate that your search is failing.

      For any non-trivial project you would take the test code and wrap it up in something like Test::More as a unit test so that you have a convenient way to test all the edge cases (zero, one or two words in the list for example) whenever you alter the code.


      Perl is environmentally friendly - it saves trees