in reply to Re^3: Check word presence WITHOUT hashes or grep
in thread Check word presence WITHOUT hashes or grep

Hello holy monks. I gave it a try, but using the following script I made just prints out every word, even if it does exist in the dictionary file. Could you point me to where I'm wrong ? I'd really appreciate it.


#!/usr/bin/perl #This script is used to extract words not found in the dictionary file + from corpus data. For this, we use binary search. Linear source woul +d take too long and use too much resources. use strict; use warnings; #Use encode because of special characters. use encoding "utf8"; use open IN => "utf8"; use open OUT => "utf8"; binmode STDIN => "utf8"; binmode STDOUT => "utf8"; my $wordlist = shift; my @allwords; #array containing all dictionary words. #First, I open the dictionary file. I then push all words into the all +words array. open WORDLIST, $wordlist; while (<WORDLIST>){ chomp; s/\r//; my $word = $_; push (@allwords,$word) } close WORDLIST; #I then sort the array in alphabetic order. my @sorted_wordlist = sort {$a cmp $b} @allwords; #I create a subroutine to use binary search. sub binary_search { my ($array, $target) = @_; #set arguments for future use : $array will be the sorted wordlist a +nd $target, the word we will be looking for. my ($low, $high) = (0, @$array - 1); #Declare high and low indexes. Low index = 0 and high index = last i +ndex of the array. while ($low < $high) { # If high index is higher than the low index, + keep the window open. my $cur = int($low+$high)/2; #Declare a middle, which is the total + of high index and low index /2. if ($array->[$cur] lt $target) { $low = $cur + 1; #If the target is too small, try lower. } else { $high = $cur; #Else, try higher. } } } # Open the corpus data. while (<>){ chomp; s/\r//; my $corpus_word = $_; #Declare the read line as a corpus word. my $index = binary_search (\@sorted_wordlist, $corpus_word); #use +the binary search to find the index if($index < @sorted_wordlist && $sorted_wordlist[$index] eq $corpu +s_word){ #If found, do nothing. } else{ print "$corpus_word\n"; #If not, print. } }

Replies are listed 'Best First'.
Re^5: Check word presence WITHOUT hashes or grep
by ysth (Canon) on May 01, 2008 at 03:38 UTC
    You aren't explicitly returning anything from binary_search(), so the return is the value of the last expression executed, which happens to be false. So your $index is always numerically 0, and the word is printed unless it was the first word in the array. Having a binary search return the last index examined can make sense when you may want to insert a new element if it wasn't found, but if you just want to know whether the target was found or not, a boolean return would make more sense.

    Your binary_search has a few problems. You need to do the int() after dividing by 2, or your limits will be non-integers and you'll see odd errors. You have an if/else in your while loop, but there are three possible cases (eq, lt, gt) you need to take into consideration. You need to set $high to $cur + 1 - 1, not $cur, or you may have an endless loop.

      Ok, here is my script after some corrections:

      #!/usr/bin/perl #This script is used to extract words not found in the dictionary file + from corpus data. For this, we use binary search. Linear source woul +d take too long and use too much resources. use strict; use warnings; #Use encode because of special characters. use encoding "utf8"; use open IN => "utf8"; use open OUT => "utf8"; binmode STDIN => "utf8"; binmode STDOUT => "utf8"; my $wordlist = shift; my @allwords; #array containing all dictionary words. #First, I open the dictionary file. I then push all words into the all +words array. open WORDLIST, $wordlist; while (<WORDLIST>){ chomp; s/\r//; my $word = $_; push (@allwords,$word) } close WORDLIST; #I then sort the array in alphabetic order. my @sorted_wordlist = sort {$a cmp $b} @allwords; #I create a subroutine to use binary search. sub binary_search { my ($array, $target) = @_; #set arguments for future use : $array will be the sorted wordlist a +nd $target, the word we will be looking for. my ($low, $high) = (0, @$array - 1); #Declare high and low indexes. Low index = 0 and high index = last i +ndex of the array. while ($low < $high) { # If high index is higher than the low index, + keep the window open. my $cur = int(($low+$high)/2); #Declare a middle, which is the tot +al of high index and low index /2. if ($array->[$cur] lt $target) { $low = $cur + 1; #If the target is too small, try lower. } elsif ($array->[$cur] gt $target) { $high = $cur - 1; #Else, try higher. } else{ return $cur; #Got it! } } return; #It doesn't exist. } # Open the corpus data. while (<>){ chomp; s/\r//; my $corpus_word = $_; #Declare the read line as a corpus word. my $index = binary_search (\@sorted_wordlist, $corpus_word); #use +the binary search to find the index if($index == 0){ #if index is not returned, then the word doesn't +exist. print "$corpus_word\n"; } else{ } }

      It did give me some results, but :
      1)I'm getting words that ARE in the dictionnary file.
      2)it gives me the use of unitialized value in numeric eq (==).

      Looking at the "Mastering algorithms with perl" O'Reilly Book, it turned out that I had to do $high = $cur - 1 and not +1, to adjust $high.
      I don't know why I'm still getting words that exist in the dictionnary file, but I understand why I get the error message, but don't know how to make it cleaner. Any ideas ?
        You're getting there. Now you are returning the index if the word was found and undef if it wasn't, and you get a warning when you compare undef == 0. You could change your if statement to if ( ! defined $index ). Your words not being found may be because you should be doing while ($low <= $high). The way you have it, you are quitting when you've narrowed down the range to one element instead of checking that element.
        Looking at the "Mastering algorithms with perl" O'Reilly Book, it turned out that I had to do $high = $cur - 1 and not +1, to adjust $high.
        Sorry, my mistake.
Re^5: Check word presence WITHOUT hashes or grep
by GrandFather (Saint) on May 01, 2008 at 04:13 UTC

    Try stripping your code down to a sample that tests the binary search:

    use strict; use warnings; my @allwords = sort qw(array will be the sorted wordlist and target); sub binary_search { my ($array, $target) = @_; #set arguments for future use : $array will be the sorted wordlist + and $target, the word we will be looking for. my ($low, $high) = (0, @$array - 1); #Declare high and low indexes. Low index = 0 and high index = last + index of the array. while ($low < $high) { # If high index is higher than the low index, keep the window + open. my $cur = int ($low + $high) / 2 ; #Declare a middle, which is the total of high index and +low index /2. if ($array->[$cur] lt $target) { $low = $cur + 1; #If the target is too small, try lower +. } else { $high = $cur; #Else, try higher. } } } for my $word (qw (search wordlist)) { my $index = binary_search (\@allwords, $word); next if $index >= @allwords; if ($allwords[$index] eq $word) { print "Found $word at $index\n"; } else { print "Couldn't find $word\n"; } }

    Prints:

    Couldn't find search Couldn't find wordlist

    which seems to indicate that your search is failing.

    For any non-trivial project you would take the test code and wrap it up in something like Test::More as a unit test so that you have a convenient way to test all the edge cases (zero, one or two words in the list for example) whenever you alter the code.


    Perl is environmentally friendly - it saves trees