Ok, here is my script after some corrections:
#!/usr/bin/perl
#This script is used to extract words not found in the dictionary file
+ from corpus data. For this, we use binary search. Linear source woul
+d take too long and use too much resources.
use strict;
use warnings;
#Use encode because of special characters.
use encoding "utf8";
use open IN => "utf8";
use open OUT => "utf8";
binmode STDIN => "utf8";
binmode STDOUT => "utf8";
my $wordlist = shift;
my @allwords; #array containing all dictionary words.
#First, I open the dictionary file. I then push all words into the all
+words array.
open WORDLIST, $wordlist;
while (<WORDLIST>){
chomp;
s/\r//;
my $word = $_;
push (@allwords,$word)
}
close WORDLIST;
#I then sort the array in alphabetic order.
my @sorted_wordlist = sort {$a cmp $b} @allwords;
#I create a subroutine to use binary search.
sub binary_search {
my ($array, $target) = @_;
#set arguments for future use : $array will be the sorted wordlist a
+nd $target, the word we will be looking for.
my ($low, $high) = (0, @$array - 1);
#Declare high and low indexes. Low index = 0 and high index = last i
+ndex of the array.
while ($low < $high) { # If high index is higher than the low index,
+ keep the window open.
my $cur = int(($low+$high)/2); #Declare a middle, which is the tot
+al of high index and low index /2.
if ($array->[$cur] lt $target) {
$low = $cur + 1; #If the target is too small, try lower.
} elsif ($array->[$cur] gt $target) {
$high = $cur - 1; #Else, try higher.
} else{
return $cur; #Got it!
}
}
return; #It doesn't exist.
}
# Open the corpus data.
while (<>){
chomp;
s/\r//;
my $corpus_word = $_; #Declare the read line as a corpus word.
my $index = binary_search (\@sorted_wordlist, $corpus_word); #use
+the binary search to find the index
if($index == 0){ #if index is not returned, then the word doesn't
+exist.
print "$corpus_word\n";
} else{
}
}
It did give me some results, but :
1)I'm getting words that ARE in the dictionnary file.
2)it gives me the use of unitialized value in numeric eq (==).
Looking at the "Mastering algorithms with perl" O'Reilly Book, it turned out that I had to do $high = $cur - 1 and not +1, to adjust $high.
I don't know why I'm still getting words that exist in the dictionnary file, but I understand why I get the error message, but don't know how to make it cleaner. Any ideas ? | [reply] [d/l] |
You're getting there. Now you are returning the index if the word was found and undef if it wasn't, and you get a warning when you compare undef == 0. You could change your if statement to if ( ! defined $index ). Your words not being found may be because you should be doing while ($low <= $high). The way you have it, you are quitting when you've narrowed down the range to one element instead of checking that element.
Looking at the "Mastering algorithms with perl" O'Reilly Book, it turned out that I had to do $high = $cur - 1 and not +1, to adjust $high.
Sorry, my mistake.
| [reply] [d/l] [select] |
I corrected my script with your propositions, and now it works just fine. So thank you, ysth, and thank you all perl monks.
I learned a lot thanks to you and finally got a script up and working.
| [reply] |