locked_user zanruka has asked for the wisdom of the Perl Monks concerning the following question:

this is essentially what i have at the moment, what is supposed to happen is the user types in a word and the word is split down into letter pairs
e.g. clock -> cl lo oc ck then a dictionary file is read in and the same procedure is run for each word in the file. then the letter pairs of the word that is typed in by the user is compared to the letter pairs of the words in the dictionary for any matches.
Anything at all to help me progress with this would be greatly appreciated, thanks.
use porter; # load the porter stemmer module print "Please type in the word you would like to spell check.\n"; $word = <STDIN>; $stem = porter($word); print "$word has been shortened to $stem for the purpose of spell chec +king!\n"; open (DICT, "dictionary.txt") || die "cant open dictionary.txt\n"; # O +pen the dictionary file for reading while (<DICT>) { $dict = $_; #Take a line of the file and put it into a variable chomp($dict); #strip out any control characters ie \n \cf etc $dictionary{$dict} = $dict; #Create a hash of the words in the dic +tionary #foreach $dict } $offset = 0; while ($stem gt $offset) { $ngram = substr($stem,$offset,2); foreach $ngram (@ngram) { print "@ngram\n"; } $offset++; } close (DICT);

Update: id like to thank everyone that helped me out while i was in a bit of a rut, with the contributions of the helpful people here and a few very late nights it finally does what its supposed to do, thanks again!!

Replies are listed 'Best First'.
Re: dice's coefficient
by moritz (Cardinal) on Apr 13, 2008 at 22:45 UTC
    A few thougts:

    1) Always use strict; use warnings; and declare your variables.

    2) Think about the type of your variables. while ($stem gt $offset) seems pretty useless to me. I'd suspect $stem to be a string, $offset certainly is a number. Maybe you need while ($offset < length($stem) - 1){ ... }?

    3) $dictionary{$dict} = $dict; are you sure you need exactly that? I'd somehow suspect that need to look up those digrams in the hash, but that won't work if you take the whole word as hash key.

    I think you need to be more specific about what you try to achieve, and how you want to do it (I'm not familiar with dice's coefficient, and I'm sure I'm not the only one).

Re: dice's coefficient
by GrandFather (Saint) on Apr 13, 2008 at 23:21 UTC

    The following may get you headed in a useful direction:

    use strict; use warnings; use List::Compare; my @words = qw(dictate world mamal); my %dict; # Build a lookup for the dictionary words while (defined (my $word = <DATA>)) { chomp $word; next unless length $word; my @bigrams = grep length == 2, map {substr $word, $_, 2} 0 .. len +gth ($word) - 1; next unless @bigrams; $dict{$word} = \@bigrams; } # Process the given words for my $word (@words) { my @bigrams = grep length == 2, map {substr $word, $_, 2} 0 .. len +gth ($word) - 1; next unless @bigrams; for my $dictWord (keys %dict) { my $lc = List::Compare->new($dict{$dictWord}, \@bigrams); my @common = $lc->get_intersection (); my $diceCoef = 2 * @common / (@bigrams + @{$dict{$dictWord}}); next unless $diceCoef; print "Dice coefficient for '$word' and '$dictWord' is $diceCo +ef\n"; } } __DATA__ a small dictionary of words

    Prints:

    Dice coefficient for 'dictate' and 'dictionary' is 0.4 Dice coefficient for 'world' and 'words' is 0.5 Dice coefficient for 'mamal' and 'small' is 0.5

    Perl is environmentally friendly - it saves trees
      There is a neat (and usually quite fast) regex hack for extracting overlapping patterns:

      perl -wMstrict -e "for my $word (@ARGV) { my @bigrams = $word =~ m{ (?= (..) ) }xmsg; print qq(bigrams of $word: @bigrams \n) } " foo wibble a be bigrams of foo: fo oo bigrams of wibble: wi ib bb bl le bigrams of a: bigrams of be: be

      (I think Grandfather is well aware of this hack and did not suggest it because he suspects it is a bit above zanruka's current coefficient of proficiency.)

        GrandFather is well aware of it and forgets about it pretty much every time something like this comes up :(.


        Perl is environmentally friendly - it saves trees
        It's neat, but it's slower than using split:
        use Benchmark qw(cmpthese); my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) + - 2) }, }; Rate regex substr regex 13917/s -- -30% substr 19910/s 43% --
Re: dice's coefficient
by ysth (Canon) on Apr 14, 2008 at 04:21 UTC
    What's the coefficient of "aardvark" and "dark"? "aardvark" and "arbitrary"? Even http://en.wikipedia.org/wiki/Dice%27s_coefficient doesn't clarify this.

    Not sure what you want to do with the coefficients, so I made stuff up:

    use strict; use warnings; $| = 1; print "Enter word: "; chomp(my $word = <STDIN>); my @pairs = $word =~ /(?=(..))/g; my $matcher = qr/(?=(@{[join "|", @pairs]}))/; my %coef; open my $dict, "<", "/usr/share/dict/words" or die "Couldn't open dictionary: $!"; while (my $dictword = <$dict>) { chomp($dictword); # skip proper nouns and anything with a non-letter next if $dictword =~ /[^a-z]/; my $matches = () = $dictword =~ /$matcher/g; my $coef = 2 * $matches / (@pairs + length($dictword)-1); push @{$coef{$coef}}, $dictword; } print "Top coefficients for $word:\n"; for my $coef ((sort { $b <=> $a } keys %coef)[0..4]) { next if ! $coef; print "$coef: ", join " ", @{$coef{$coef}}, "\n"; }
      ySTH,

      Thats really great what you just said. I was searching for such help for a while.

      but there is an issue, how to make it match the expression only once. for example assuming word is:
      "gogo" the ngram will be go-og-go
      word requested to check is: "golo" so the gram will be go-ol-lo

      The number of matches in the current code will be 2 as it counted go twice, although it should only be counted once as it should be matched only once.. so the score should be 1....

      Please Help.

      waiting your kind reply
      Thank you very much
Re: thanks very much
by ww (Archbishop) on Apr 17, 2008 at 02:53 UTC

    If, as it appears above, you removed your original question and changed the title, please *DO NOT* do so again. (And if I'm wrong, whack me with a clue-by-four.)

    As one of the gods recently remarked to another who replaced the original content of a post, doing so is "rude."

    I concur: Doing so greatly reduces (or completely obviates) that chance that some future Monk will benefit from the work people put into helping you solve your problem.