Re^2: dice's coefficient

There is a neat (and usually quite fast) regex hack for extracting overlapping patterns:

perl -wMstrict -e
"for my $word (@ARGV) {
  my @bigrams = $word =~ m{ (?= (..) ) }xmsg;
  print qq(bigrams of $word: @bigrams \n)
  }
" foo wibble a be
bigrams of foo: fo oo
bigrams of wibble: wi ib bb bl le
bigrams of a:
bigrams of be: be
[download]

(I think Grandfather is well aware of this hack and did not suggest it because he suspects it is a bit above zanruka's current coefficient of proficiency.)

Comment on Re^2: dice's coefficient Download Code

Replies are listed 'Best First'.
Re^3: dice's coefficient by GrandFather (Saint) on Apr 14, 2008 at 09:57 UTC
GrandFather is well aware of it and forgets about it pretty much every time something like this comes up :(. Perl is environmentally friendly - it saves trees	[reply]
Re^3: dice's coefficient by Anonymous Monk on Jan 14, 2012 at 10:49 UTC
It's neat, but it's slower than using split: `use Benchmark qw(cmpthese); my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) + - 2) }, }; Rate regex substr regex 13917/s -- -30% substr 19910/s 43% --` [download]	[reply] [d/l]
Re^4: dice's coefficient by AnomalousMonk (Archbishop) on Jan 14, 2012 at 12:00 UTC
unpack is even faster, even with the need to calculate the `$n` repeat count. (There's probably a way to get rid of this calculation, but I can't see it at the moment.) >perl -wMstrict -le "use Benchmark qw(cmpthese); use Test::More 'no_plan'; ;; my $str = 'wwibblewibblewibblewibbleibblewibblewibblewibble'; ;; cmpthese -1, { regex => sub { () = $str =~ /(?=(..))/g }, substr => sub { () = map { substr $str, $_, 2 } (0 .. length($str) - 2) }, unpack => sub { my $n = length($str) ? length($str) - 1 : 0; () = unpack qq{(a2 X)$n}, $str; }, }; ;; sub bigrams { my $n = length($_[0]) ? length($_[0]) - 1 : 0; return unpack qq{(a2 X)$n}, $_[0]; } ;; is_deeply [ bigrams('') ], []; is_deeply [ bigrams('a') ], []; is_deeply [ bigrams('ab') ], [ qw(ab) ]; is_deeply [ bigrams('abc') ], [ qw(ab bc) ]; is_deeply [ bigrams('abcd') ], [ qw(ab bc cd) ]; is_deeply [ bigrams('abcde') ], [ qw(ab bc cd de) ]; " Rate regex substr unpack regex 11934/s -- -34% -66% substr 18066/s 51% -- -48% unpack 34816/s 192% 93% -- ok 1 ok 2 ok 3 ok 4 ok 5 ok 6 1..6 [download]	[reply] [d/l] [select]
Re^5: dice's coefficient by Anonymous Monk on Jan 15, 2012 at 02:27 UTC
It works well in that benchmark, but falls down here: `use Benchmark qw(cmpthese); my $str = "wwibblewibblewibblewibbleibblewibblewibblewibble"; cmpthese -1, { regex => sub { my %count; ++$count{$_} for $str =~ /(?=(..))/g; }, substr => sub { my %count; ++$count{substr $str, $_, 2} for (0 .. length($str) - 2); }, unpack => sub { my %count; my $n = length($str) - 1; ++$count{$_} for unpack qq{(a2 X)$n}, $str; }, }; Rate regex unpack substr regex 15316/s -- -43% -74% unpack 26935/s 76% -- -54% substr 58514/s 282% 117% --` [download] substr slows down 50% if the string contains utf-8 characters, but it's still significantly faster than unpack in this benchmark: `my $str = "wwibblewibblewibblewibbleibblewibblewibblewibble\x{20ac}"; Rate regex unpack substr regex 14222/s -- -35% -51% unpack 21976/s 55% -- -24% substr 29020/s 104% 32% --` [download]	[reply] [d/l] [select]