dice's coefficient

locked_user zanruka has asked for the wisdom of the Perl Monks concerning the following question:

this is essentially what i have at the moment, what is supposed to happen is the user types in a word and the word is split down into letter pairs
e.g. clock -> cl lo oc ck then a dictionary file is read in and the same procedure is run for each word in the file. then the letter pairs of the word that is typed in by the user is compared to the letter pairs of the words in the dictionary for any matches.
Anything at all to help me progress with this would be greatly appreciated, thanks.

use porter; # load the porter stemmer module

print "Please type in the word you would like to spell check.\n";
$word = <STDIN>;
$stem = porter($word);
print "$word has been shortened to $stem for the purpose of spell chec
+king!\n";

open (DICT, "dictionary.txt") || die "cant open dictionary.txt\n"; # O
+pen the dictionary file for reading
while (<DICT>) {
    $dict = $_; #Take a line of the file and put it into a variable
    chomp($dict); #strip out any control characters ie \n \cf etc
    $dictionary{$dict} = $dict; #Create a hash of the words in the dic
+tionary
    #foreach $dict
}

$offset = 0;
while ($stem gt $offset) {
    $ngram = substr($stem,$offset,2);
    foreach $ngram (@ngram) {
    print "@ngram\n";
    }
    $offset++;
}

close (DICT);
[download]

Update: id like to thank everyone that helped me out while i was in a bit of a rut, with the contributions of the helpful people here and a few very late nights it finally does what its supposed to do, thanks again!!

Replies are listed 'Best First'.
Re: dice's coefficient by moritz (Cardinal) on Apr 13, 2008 at 22:45 UTC
A few thougts: 1) Always `use strict; use warnings;` and declare your variables. 2) Think about the type of your variables. `while ($stem gt $offset)` seems pretty useless to me. I'd suspect `$stem` to be a string, `$offset` certainly is a number. Maybe you need `while ($offset < length($stem) - 1){ ... }`? 3) `$dictionary{$dict} = $dict;` are you sure you need exactly that? I'd somehow suspect that need to look up those digrams in the hash, but that won't work if you take the whole word as hash key. I think you need to be more specific about what you try to achieve, and how you want to do it (I'm not familiar with dice's coefficient, and I'm sure I'm not the only one).	[reply] [d/l] [select]
Re: dice's coefficient by GrandFather (Saint) on Apr 13, 2008 at 23:21 UTC
The following may get you headed in a useful direction: use strict; use warnings; use List::Compare; my @words = qw(dictate world mamal); my %dict; # Build a lookup for the dictionary words while (defined (my $word = <DATA>)) { chomp $word; next unless length $word; my @bigrams = grep length == 2, map {substr $word, $_, 2} 0 .. len +gth ($word) - 1; next unless @bigrams; $dict{$word} = \@bigrams; } # Process the given words for my $word (@words) { my @bigrams = grep length == 2, map {substr $word, $_, 2} 0 .. len +gth ($word) - 1; next unless @bigrams; for my $dictWord (keys %dict) { my $lc = List::Compare->new($dict{$dictWord}, \@bigrams); my @common = $lc->get_intersection (); my $diceCoef = 2 * @common / (@bigrams + @{$dict{$dictWord}}); next unless $diceCoef; print "Dice coefficient for '$word' and '$dictWord' is $diceCo +ef\n"; } } __DATA__ a small dictionary of words [download] Prints: `Dice coefficient for 'dictate' and 'dictionary' is 0.4 Dice coefficient for 'world' and 'words' is 0.5 Dice coefficient for 'mamal' and 'small' is 0.5` [download] Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re^2: dice's coefficient by Anonymous Monk on Apr 14, 2008 at 07:51 UTC
There is a neat (and usually quite fast) regex hack for extracting overlapping patterns: `perl -wMstrict -e "for my $word (@ARGV) { my @bigrams = $word =~ m{ (?= (..) ) }xmsg; print qq(bigrams of $word: @bigrams \n) } " foo wibble a be bigrams of foo: fo oo bigrams of wibble: wi ib bb bl le bigrams of a: bigrams of be: be` [download] (I think Grandfather is well aware of this hack and did not suggest it because he suspects it is a bit above zanruka's current coefficient of proficiency.)	[reply] [d/l]
Re^3: dice's coefficient by GrandFather (Saint) on Apr 14, 2008 at 09:57 UTC
GrandFather is well aware of it and forgets about it pretty much every time something like this comes up :(. Perl is environmentally friendly - it saves trees	[reply]
Re^3: dice's coefficient by Anonymous Monk on Jan 14, 2012 at 10:49 UTC
It's neat, but it's slower than using split: `use Benchmark qw(cmpthese); my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) + - 2) }, }; Rate regex substr regex 13917/s -- -30% substr 19910/s 43% --` [download]	[reply] [d/l]
Re^4: dice's coefficient by AnomalousMonk (Archbishop) on Jan 14, 2012 at 12:00 UTC
Re^5: dice's coefficient by Anonymous Monk on Jan 15, 2012 at 02:27 UTC
Re: dice's coefficient by ysth (Canon) on Apr 14, 2008 at 04:21 UTC
What's the coefficient of "aardvark" and "dark"? "aardvark" and "arbitrary"? Even http://en.wikipedia.org/wiki/Dice%27s_coefficient doesn't clarify this. Not sure what you want to do with the coefficients, so I made stuff up: use strict; use warnings; $\| = 1; print "Enter word: "; chomp(my $word = <STDIN>); my @pairs = $word =~ /(?=(..))/g; my $matcher = qr/(?=(@{[join "\|", @pairs]}))/; my %coef; open my $dict, "<", "/usr/share/dict/words" or die "Couldn't open dictionary: $!"; while (my $dictword = <$dict>) { chomp($dictword); # skip proper nouns and anything with a non-letter next if $dictword =~ /[^a-z]/; my $matches = () = $dictword =~ /$matcher/g; my $coef = 2 * $matches / (@pairs + length($dictword)-1); push @{$coef{$coef}}, $dictword; } print "Top coefficients for $word:\n"; for my $coef ((sort { $b <=> $a } keys %coef)[0..4]) { next if ! $coef; print "$coef: ", join " ", @{$coef{$coef}}, "\n"; } [download] -- Online Fortune Cookie Search	[reply] [d/l]
Re^2: dice's coefficient by hiddenOx (Novice) on Apr 19, 2008 at 06:20 UTC
ySTH, Thats really great what you just said. I was searching for such help for a while. but there is an issue, how to make it match the expression only once. for example assuming word is: "gogo" the ngram will be go-og-go word requested to check is: "golo" so the gram will be go-ol-lo The number of matches in the current code will be 2 as it counted go twice, although it should only be counted once as it should be matched only once.. so the score should be 1.... Please Help. waiting your kind reply Thank you very much	[reply]
Re: thanks very much by ww (Archbishop) on Apr 17, 2008 at 02:53 UTC
If, as it appears above, you removed your original question and changed the title, please DO NOT do so again. (And if I'm wrong, whack me with a clue-by-four.) As one of the gods recently remarked to another who replaced the original content of a post, doing so is "rude." I concur: Doing so greatly reduces (or completely obviates) that chance that some future Monk will benefit from the work people put into helping you solve your problem.	[reply]