in reply to Check word presence WITHOUT hashes or grep

Since you are supposed to be doing this for your learning, I don't think anyone will just hand you the code you want, but here's a description of a binary search:
  1. Keep a high and a low index into the array that demarcate the possible range that may contain your sought word. Initially, the low index is 0 and the high index is the last index of the array, or -1 if the array is empty.
  2. See if the high index is at least as great as the low index (i.e. the possible range of the array that might contain the sought word is not empty). If not, the sought word is not in the array and you are done.
  3. Find the midpoint between the high and low indexes, rounding either up or down.
  4. If the word at the midpoint index is greater than (using a string comparison) the sought word, the sought word must occur earlier in the array, if at all. Set the high index to one less than the midpoint index, and continue at step 2.
  5. Otherwise, if the word at the midpoint index is less than (using a string comparison) the sought word, the sought word must occur later in the array, if at all. Set the low index to one more than the midpoint index, and continue at step 2.
  6. Otherwise, the sought word is at the midpoint index: you've found it and are done.
A common variation is to have the "high" index be one more than the highest possible index.

Test well; it's very common for people writing binary search code to have off-by-one errors that result in false negatives or endless loops.

  • Comment on Re: Check word presence WITHOUT hashes or grep

Replies are listed 'Best First'.
Re^2: Check word presence WITHOUT hashes or grep
by gojippo (Novice) on Apr 30, 2008 at 08:37 UTC
    Ok ysth, thank you for your help. Now I think I get how it works, I'll give it a try and tell you guys how it went.
    Thank you all for your patience.
        Hello holy monks. I gave it a try, but using the following script I made just prints out every word, even if it does exist in the dictionary file. Could you point me to where I'm wrong ? I'd really appreciate it.


        #!/usr/bin/perl #This script is used to extract words not found in the dictionary file + from corpus data. For this, we use binary search. Linear source woul +d take too long and use too much resources. use strict; use warnings; #Use encode because of special characters. use encoding "utf8"; use open IN => "utf8"; use open OUT => "utf8"; binmode STDIN => "utf8"; binmode STDOUT => "utf8"; my $wordlist = shift; my @allwords; #array containing all dictionary words. #First, I open the dictionary file. I then push all words into the all +words array. open WORDLIST, $wordlist; while (<WORDLIST>){ chomp; s/\r//; my $word = $_; push (@allwords,$word) } close WORDLIST; #I then sort the array in alphabetic order. my @sorted_wordlist = sort {$a cmp $b} @allwords; #I create a subroutine to use binary search. sub binary_search { my ($array, $target) = @_; #set arguments for future use : $array will be the sorted wordlist a +nd $target, the word we will be looking for. my ($low, $high) = (0, @$array - 1); #Declare high and low indexes. Low index = 0 and high index = last i +ndex of the array. while ($low < $high) { # If high index is higher than the low index, + keep the window open. my $cur = int($low+$high)/2; #Declare a middle, which is the total + of high index and low index /2. if ($array->[$cur] lt $target) { $low = $cur + 1; #If the target is too small, try lower. } else { $high = $cur; #Else, try higher. } } } # Open the corpus data. while (<>){ chomp; s/\r//; my $corpus_word = $_; #Declare the read line as a corpus word. my $index = binary_search (\@sorted_wordlist, $corpus_word); #use +the binary search to find the index if($index < @sorted_wordlist && $sorted_wordlist[$index] eq $corpu +s_word){ #If found, do nothing. } else{ print "$corpus_word\n"; #If not, print. } }