Re: Check word presence WITHOUT hashes or grep
by bobf (Monsignor) on Apr 30, 2008 at 05:06 UTC
|
Kudos for mentioning up front that this is a learning exercise. That makes the somewhat odd requirements less of a distraction. It also means that my suggestion will apply to your current approach rather than trying to change it.
Since you've got the dictionary words alphabetized in an array, you could use a binary search to determine if a word in your corpus is in the array. You should be able to find some things around here on that pretty easily if you use Super Search and sprinkle with key words liberally.
Good luck
| [reply] |
|
|
++ IIRC from a math class back in the Cambrian, binary search is unbeatable on sorted data. Given a 500,000 word dictionary you will never have to do more than, let's see...
my $lookups = 0;
my $words = 500_000;
while ( $words > 1 )
{
$words /= 2;
$lookups++;
}
print $lookups, "\n";
...19 lookups -- meaning comparisons with some combination of lt, gt, ge, le, eq.
| [reply] [d/l] [select] |
|
|
Thank you all for your help. You aren't perl monks, but perl gods !
Just a question to my mother, what does the
$words /= 2;
part mean ? Divide $words into 2, to get 250 000 ?
| [reply] [d/l] |
|
|
|
|
Thank you for your help, bobf. I looked up for binary search on the Super Search, but didn't find anything easy enough for me to understand.
How could I use binary search to determine if a word in my corpus is there ?
| [reply] |
|
|
Binary search, and algorithms in general, is a field you must try to get a grasp on no matter what programming languages you work with. Often in a high level programming language you don't need to implement them yourself, but understanding these issues will have profound impact on how well you can learn and exploit a particular programming language. The more algorithms and datastructures you are familiar with, the easier it will be to see good simple solutions to what seems a hard task without such knowledge.
Binary search is the process where you halve the remaining search space for each test. So, lets say you have your dictionary of say 100.000 words in a SORTED array. Now you want to test if the word Trzagrat is in there somewhere. First you look on the position in the middle of your dictionary array. Either you find the word there, or if not, you will know which half of the dictionary must hold the word if it's there at all, because the dictionary is sorted, and the word you're looking up either sorts before or after the word you found at this position. So you contiune...
A better explanation on Wikipedia, binary search
| [reply] |
|
|
|
|
|
|
|
|
Ok, sorry for the question, I just found a nice thread.
| [reply] |
Re: Check word presence WITHOUT hashes or grep
by ysth (Canon) on Apr 30, 2008 at 08:17 UTC
|
Since you are supposed to be doing this for your learning, I don't think anyone will just hand you the code you want, but here's a description of a binary search:
- Keep a high and a low index into the array that demarcate the possible range
that may contain your sought word. Initially, the low index is 0 and the high index is the last index of the array, or -1 if the array is empty.
- See if the high index is at least as great as the low index (i.e. the possible range of the array that might contain the sought word is not empty). If not, the sought word is not in the array and you are done.
- Find the midpoint between the high and low indexes, rounding either up or down.
- If the word at the midpoint index is greater than (using a string comparison) the sought word, the sought word must occur earlier in the array, if at all. Set the high index to one less than the midpoint index, and continue at step 2.
- Otherwise, if the word at the midpoint index is less than (using a string comparison) the sought word, the sought word must occur later in the array, if at all. Set the low index to one more than the midpoint index, and continue at step 2.
- Otherwise, the sought word is at the midpoint index: you've found it and are done.
A common variation is to have the "high" index be one more than the highest possible index.
Test well; it's very common for people writing binary search code to have off-by-one errors that result in false negatives or endless loops.
| [reply] |
|
|
Ok ysth, thank you for your help. Now I think I get how it works, I'll give it a try and tell you guys how it went.
Thank you all for your patience.
| [reply] |
|
|
Let us know if there's some particular part you get stuck on.
| [reply] |
|
|
|
|
|
|
|
Re: Check word presence WITHOUT hashes or grep
by pc88mxer (Vicar) on Apr 30, 2008 at 05:07 UTC
|
Well, you could always do what we had to do before we had grep and hashes:
my $found;
for my $w (@word_list) {
if ($w eq $corpus_word) {
$found = 1;
}
}
# $found == 1 if $corpus_word was found in @word_list
Of course, there might be more efficient ways to do this. | [reply] [d/l] [select] |
|
|
| [reply] |