in reply to Re^4: Check word presence WITHOUT hashes or grep
in thread Check word presence WITHOUT hashes or grep

You aren't explicitly returning anything from binary_search(), so the return is the value of the last expression executed, which happens to be false. So your $index is always numerically 0, and the word is printed unless it was the first word in the array. Having a binary search return the last index examined can make sense when you may want to insert a new element if it wasn't found, but if you just want to know whether the target was found or not, a boolean return would make more sense.

Your binary_search has a few problems. You need to do the int() after dividing by 2, or your limits will be non-integers and you'll see odd errors. You have an if/else in your while loop, but there are three possible cases (eq, lt, gt) you need to take into consideration. You need to set $high to $cur + 1 - 1, not $cur, or you may have an endless loop.

  • Comment on Re^5: Check word presence WITHOUT hashes or grep

Replies are listed 'Best First'.
Re^6: Check word presence WITHOUT hashes or grep
by gojippo (Novice) on May 01, 2008 at 05:08 UTC
    Ok, here is my script after some corrections:

    #!/usr/bin/perl #This script is used to extract words not found in the dictionary file + from corpus data. For this, we use binary search. Linear source woul +d take too long and use too much resources. use strict; use warnings; #Use encode because of special characters. use encoding "utf8"; use open IN => "utf8"; use open OUT => "utf8"; binmode STDIN => "utf8"; binmode STDOUT => "utf8"; my $wordlist = shift; my @allwords; #array containing all dictionary words. #First, I open the dictionary file. I then push all words into the all +words array. open WORDLIST, $wordlist; while (<WORDLIST>){ chomp; s/\r//; my $word = $_; push (@allwords,$word) } close WORDLIST; #I then sort the array in alphabetic order. my @sorted_wordlist = sort {$a cmp $b} @allwords; #I create a subroutine to use binary search. sub binary_search { my ($array, $target) = @_; #set arguments for future use : $array will be the sorted wordlist a +nd $target, the word we will be looking for. my ($low, $high) = (0, @$array - 1); #Declare high and low indexes. Low index = 0 and high index = last i +ndex of the array. while ($low < $high) { # If high index is higher than the low index, + keep the window open. my $cur = int(($low+$high)/2); #Declare a middle, which is the tot +al of high index and low index /2. if ($array->[$cur] lt $target) { $low = $cur + 1; #If the target is too small, try lower. } elsif ($array->[$cur] gt $target) { $high = $cur - 1; #Else, try higher. } else{ return $cur; #Got it! } } return; #It doesn't exist. } # Open the corpus data. while (<>){ chomp; s/\r//; my $corpus_word = $_; #Declare the read line as a corpus word. my $index = binary_search (\@sorted_wordlist, $corpus_word); #use +the binary search to find the index if($index == 0){ #if index is not returned, then the word doesn't +exist. print "$corpus_word\n"; } else{ } }

    It did give me some results, but :
    1)I'm getting words that ARE in the dictionnary file.
    2)it gives me the use of unitialized value in numeric eq (==).

    Looking at the "Mastering algorithms with perl" O'Reilly Book, it turned out that I had to do $high = $cur - 1 and not +1, to adjust $high.
    I don't know why I'm still getting words that exist in the dictionnary file, but I understand why I get the error message, but don't know how to make it cleaner. Any ideas ?
      You're getting there. Now you are returning the index if the word was found and undef if it wasn't, and you get a warning when you compare undef == 0. You could change your if statement to if ( ! defined $index ). Your words not being found may be because you should be doing while ($low <= $high). The way you have it, you are quitting when you've narrowed down the range to one element instead of checking that element.
      Looking at the "Mastering algorithms with perl" O'Reilly Book, it turned out that I had to do $high = $cur - 1 and not +1, to adjust $high.
      Sorry, my mistake.
        I corrected my script with your propositions, and now it works just fine. So thank you, ysth, and thank you all perl monks.
        I learned a lot thanks to you and finally got a script up and working.