in reply to dice's coefficient

The following may get you headed in a useful direction:

use strict; use warnings; use List::Compare; my @words = qw(dictate world mamal); my %dict; # Build a lookup for the dictionary words while (defined (my $word = <DATA>)) { chomp $word; next unless length $word; my @bigrams = grep length == 2, map {substr $word, $_, 2} 0 .. len +gth ($word) - 1; next unless @bigrams; $dict{$word} = \@bigrams; } # Process the given words for my $word (@words) { my @bigrams = grep length == 2, map {substr $word, $_, 2} 0 .. len +gth ($word) - 1; next unless @bigrams; for my $dictWord (keys %dict) { my $lc = List::Compare->new($dict{$dictWord}, \@bigrams); my @common = $lc->get_intersection (); my $diceCoef = 2 * @common / (@bigrams + @{$dict{$dictWord}}); next unless $diceCoef; print "Dice coefficient for '$word' and '$dictWord' is $diceCo +ef\n"; } } __DATA__ a small dictionary of words

Prints:

Dice coefficient for 'dictate' and 'dictionary' is 0.4 Dice coefficient for 'world' and 'words' is 0.5 Dice coefficient for 'mamal' and 'small' is 0.5

Perl is environmentally friendly - it saves trees

Replies are listed 'Best First'.
Re^2: dice's coefficient
by Anonymous Monk on Apr 14, 2008 at 07:51 UTC
    There is a neat (and usually quite fast) regex hack for extracting overlapping patterns:

    perl -wMstrict -e "for my $word (@ARGV) { my @bigrams = $word =~ m{ (?= (..) ) }xmsg; print qq(bigrams of $word: @bigrams \n) } " foo wibble a be bigrams of foo: fo oo bigrams of wibble: wi ib bb bl le bigrams of a: bigrams of be: be

    (I think Grandfather is well aware of this hack and did not suggest it because he suspects it is a bit above zanruka's current coefficient of proficiency.)

      GrandFather is well aware of it and forgets about it pretty much every time something like this comes up :(.


      Perl is environmentally friendly - it saves trees
      It's neat, but it's slower than using split:
      use Benchmark qw(cmpthese); my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) + - 2) }, }; Rate regex substr regex 13917/s -- -30% substr 19910/s 43% --

        unpack is even faster, even with the need to calculate the  $n repeat count. (There's probably a way to get rid of this calculation, but I can't see it at the moment.)

        >perl -wMstrict -le "use Benchmark qw(cmpthese); use Test::More 'no_plan'; ;; my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; ;; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) - 2) }, unpack => sub { my $n = length($str) ? length($str) - 1 : 0; () = unpack qq{(a2 X)$n}, $str; }, }; ;; sub bigrams { my $n = length($_[0]) ? length($_[0]) - 1 : 0; return unpack qq{(a2 X)$n}, $_[0]; } ;; is_deeply [ bigrams('') ], []; is_deeply [ bigrams('a') ], []; is_deeply [ bigrams('ab') ], [ qw(ab) ]; is_deeply [ bigrams('abc') ], [ qw(ab bc) ]; is_deeply [ bigrams('abcd') ], [ qw(ab bc cd) ]; is_deeply [ bigrams('abcde') ], [ qw(ab bc cd de) ]; " Rate regex substr unpack regex 11934/s -- -34% -66% substr 18066/s 51% -- -48% unpack 34816/s 192% 93% -- ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 1..6