in reply to Re: dice's coefficient
in thread dice's coefficient

There is a neat (and usually quite fast) regex hack for extracting overlapping patterns:

perl -wMstrict -e "for my $word (@ARGV) { my @bigrams = $word =~ m{ (?= (..) ) }xmsg; print qq(bigrams of $word: @bigrams \n) } " foo wibble a be bigrams of foo: fo oo bigrams of wibble: wi ib bb bl le bigrams of a: bigrams of be: be

(I think Grandfather is well aware of this hack and did not suggest it because he suspects it is a bit above zanruka's current coefficient of proficiency.)

Replies are listed 'Best First'.
Re^3: dice's coefficient
by GrandFather (Saint) on Apr 14, 2008 at 09:57 UTC

    GrandFather is well aware of it and forgets about it pretty much every time something like this comes up :(.


    Perl is environmentally friendly - it saves trees
Re^3: dice's coefficient
by Anonymous Monk on Jan 14, 2012 at 10:49 UTC
    It's neat, but it's slower than using split:
    use Benchmark qw(cmpthese); my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) + - 2) }, }; Rate regex substr regex 13917/s -- -30% substr 19910/s 43% --

      unpack is even faster, even with the need to calculate the  $n repeat count. (There's probably a way to get rid of this calculation, but I can't see it at the moment.)

      >perl -wMstrict -le "use Benchmark qw(cmpthese); use Test::More 'no_plan'; ;; my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; ;; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) - 2) }, unpack => sub { my $n = length($str) ? length($str) - 1 : 0; () = unpack qq{(a2 X)$n}, $str; }, }; ;; sub bigrams { my $n = length($_[0]) ? length($_[0]) - 1 : 0; return unpack qq{(a2 X)$n}, $_[0]; } ;; is_deeply [ bigrams('') ], []; is_deeply [ bigrams('a') ], []; is_deeply [ bigrams('ab') ], [ qw(ab) ]; is_deeply [ bigrams('abc') ], [ qw(ab bc) ]; is_deeply [ bigrams('abcd') ], [ qw(ab bc cd) ]; is_deeply [ bigrams('abcde') ], [ qw(ab bc cd de) ]; " Rate regex substr unpack regex 11934/s -- -34% -66% substr 18066/s 51% -- -48% unpack 34816/s 192% 93% -- ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 1..6
        It works well in that benchmark, but falls down here:
        use Benchmark qw(cmpthese); my $str = "wwibblewibblewibblewibbleibblewibblewibblewibble"; cmpthese -1, { regex => sub { my %count; ++$count{$_} for $str =~ /(?=(..))/g; }, substr => sub { my %count; ++$count{substr $str, $_, 2} for (0 .. length($str) - 2); }, unpack => sub { my %count; my $n = length($str) - 1; ++$count{$_} for unpack qq{(a2 X)$n}, $str; }, }; Rate regex unpack substr regex 15316/s -- -43% -74% unpack 26935/s 76% -- -54% substr 58514/s 282% 117% --
        substr slows down 50% if the string contains utf-8 characters, but it's still significantly faster than unpack in this benchmark:
        my $str = "wwibblewibblewibblewibbleibblewibblewibblewibble\x{20ac}"; Rate regex unpack substr regex 14222/s -- -35% -51% unpack 21976/s 55% -- -24% substr 29020/s 104% 32% --