comment on

Ok, here is my script after some corrections:

#!/usr/bin/perl
#This script is used to extract words not found in the dictionary file
+ from corpus data. For this, we use binary search. Linear source woul
+d take too long and use too much resources.

use strict;
use warnings;

#Use encode because of special characters.
use encoding "utf8";
use open IN => "utf8";
use open OUT => "utf8";
binmode STDIN => "utf8";
binmode STDOUT => "utf8";

my $wordlist = shift;
my @allwords; #array containing all dictionary words.

#First, I open the dictionary file. I then push all words into the all
+words array.

open WORDLIST, $wordlist;
while (<WORDLIST>){
  chomp;
  s/\r//;

  my $word = $_;
  push (@allwords,$word)
}
close WORDLIST;

#I then sort the array in alphabetic order.

my @sorted_wordlist = sort {$a cmp $b} @allwords;

#I create a subroutine to use binary search.

sub binary_search {

  my ($array, $target) = @_; 
  #set arguments for future use : $array will be the sorted wordlist a
+nd $target, the word we will be looking for.

  my ($low, $high) = (0, @$array - 1); 
  #Declare high and low indexes. Low index = 0 and high index = last i
+ndex of the array.

  while ($low < $high) { # If high index is higher than the low index,
+ keep the window open.
    my $cur = int(($low+$high)/2); #Declare a middle, which is the tot
+al of high index and low index /2.
    if ($array->[$cur] lt $target) {
      $low = $cur + 1; #If the target is too small, try lower.
    } elsif ($array->[$cur] gt $target) {
      $high = $cur - 1; #Else, try higher.
    } else{
      return $cur; #Got it!
    }
  }
  return; #It doesn't exist.
}

# Open the corpus data.

while (<>){
    chomp;
    s/\r//;
    my $corpus_word = $_; #Declare the read line as a corpus word.

    my $index = binary_search (\@sorted_wordlist, $corpus_word); #use 
+the binary search to find the index

    if($index == 0){ #if index is not returned, then the word doesn't 
+exist.
      print "$corpus_word\n";
    } else{
    }
  }
[download]

It did give me some results, but :
1)I'm getting words that ARE in the dictionnary file.
2)it gives me the use of unitialized value in numeric eq (==).

Looking at the "Mastering algorithms with perl" O'Reilly Book, it turned out that I had to do $high = $cur - 1 and not +1, to adjust $high.
I don't know why I'm still getting words that exist in the dictionnary file, but I understand why I get the error message, but don't know how to make it cleaner. Any ideas ?

In reply to Re^6: Check word presence WITHOUT hashes or grep by gojippo
in thread Check word presence WITHOUT hashes or grep by gojippo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.