Re: dice's coefficient

The following may get you headed in a useful direction:

use strict;
use warnings;
use List::Compare;

my @words = qw(dictate world mamal);
my %dict;

# Build a lookup for the dictionary words
while (defined (my $word = <DATA>)) {
    chomp $word;
    next unless length $word;
    my @bigrams = grep length == 2, map {substr $word, $_, 2} 0 .. len
+gth ($word) - 1;
    
    next unless @bigrams;
    $dict{$word} = \@bigrams;
}

# Process the given words
for my $word (@words) {
    my @bigrams = grep length == 2, map {substr $word, $_, 2} 0 .. len
+gth ($word) - 1;

    next unless @bigrams;

    for my $dictWord (keys %dict) {
        my $lc = List::Compare->new($dict{$dictWord}, \@bigrams);
        my @common = $lc->get_intersection ();
        my $diceCoef = 2 * @common / (@bigrams + @{$dict{$dictWord}});
        
        next unless $diceCoef;
        print "Dice coefficient for '$word' and '$dictWord' is $diceCo
+ef\n";
    }
}


__DATA__
a
small
dictionary
of
words
[download]

Prints:

Dice coefficient for 'dictate' and 'dictionary' is 0.4
Dice coefficient for 'world' and 'words' is 0.5
Dice coefficient for 'mamal' and 'small' is 0.5
[download]

Perl is environmentally friendly - it saves trees

Comment on Re: dice's coefficient Select or Download Code

Replies are listed 'Best First'.
Re^2: dice's coefficient by Anonymous Monk on Apr 14, 2008 at 07:51 UTC
There is a neat (and usually quite fast) regex hack for extracting overlapping patterns: `perl -wMstrict -e "for my $word (@ARGV) { my @bigrams = $word =~ m{ (?= (..) ) }xmsg; print qq(bigrams of $word: @bigrams \n) } " foo wibble a be bigrams of foo: fo oo bigrams of wibble: wi ib bb bl le bigrams of a: bigrams of be: be` [download] (I think Grandfather is well aware of this hack and did not suggest it because he suspects it is a bit above zanruka's current coefficient of proficiency.)	[reply] [d/l]
Re^3: dice's coefficient by GrandFather (Saint) on Apr 14, 2008 at 09:57 UTC
GrandFather is well aware of it and forgets about it pretty much every time something like this comes up :(. Perl is environmentally friendly - it saves trees	[reply]
Re^3: dice's coefficient by Anonymous Monk on Jan 14, 2012 at 10:49 UTC
It's neat, but it's slower than using split: `use Benchmark qw(cmpthese); my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) + - 2) }, }; Rate regex substr regex 13917/s -- -30% substr 19910/s 43% --` [download]	[reply] [d/l]
Re^4: dice's coefficient by AnomalousMonk (Archbishop) on Jan 14, 2012 at 12:00 UTC
unpack is even faster, even with the need to calculate the `$n` repeat count. (There's probably a way to get rid of this calculation, but I can't see it at the moment.) >perl -wMstrict -le "use Benchmark qw(cmpthese); use Test::More 'no_plan'; ;; my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; ;; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) - 2) }, unpack => sub { my $n = length($str) ? length($str) - 1 : 0; () = unpack qq{(a2 X)$n}, $str; }, }; ;; sub bigrams { my $n = length($_[0]) ? length($_[0]) - 1 : 0; return unpack qq{(a2 X)$n}, $_[0]; } ;; is_deeply [ bigrams('') ], []; is_deeply [ bigrams('a') ], []; is_deeply [ bigrams('ab') ], [ qw(ab) ]; is_deeply [ bigrams('abc') ], [ qw(ab bc) ]; is_deeply [ bigrams('abcd') ], [ qw(ab bc cd) ]; is_deeply [ bigrams('abcde') ], [ qw(ab bc cd de) ]; " Rate regex substr unpack regex 11934/s -- -34% -66% substr 18066/s 51% -- -48% unpack 34816/s 192% 93% -- ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 1..6 [download]	[reply] [d/l] [select]
Re^5: dice's coefficient by Anonymous Monk on Jan 15, 2012 at 02:27 UTC