comment on

I've made just a few changes (mostly to my sub, so that it comes closer to my original design - your modifications added a few unnecessary steps) and run the benchmarks again.

I personally believe that those steps were not all that unnecessary. To be definite, I chose to compare subs that accept a string and return a hashref of the counts. Yours doesn't, so some extra step is required. According to the last update to Re^4: how to count the number of repeats in a string (really!), I'm posting a new benchmark here, with your sub as a thin layer around the recursive sub. I hope that is fine...

The subs are the last updates of their respective authors, suitably modified just to be consistent with each other and at the explicit request from some of the people who took part to this thread, for generality I changed them to accept as arguments in order: the minimum string length, the minimum repetition count, and the string to be processed.

Note: I wanted to post the benchmark. But

I had problems with oha's sub even in the simplest case;
~~to be sure, before proceeding I added some tests: sadly enough they seem to fail.~~

So, instead of the benchmark, for the moment I'm posting the script with the tests:

#!/usr/bin/perl use strict; use warnings; use Test::More 'no_plan'; use Benchmark qw/:all :hireswallclock/; my $str='aabcdabcabcecdecd'; sub blazar { my ($str, $min_len, $min_rep)=@_; my $l=length($str); my %count; for my $off (0..$l-1) { for my $len ($min_len .. $l-$off) { my $s = substr $str, $off, $len; $count{ $s } ||= ()= $str =~ /$s/g; } $count{$_} < $min_rep and delete $count{$_} for keys %count; } \%count; } sub kramba { my ($str, $min_len, $min_rep)=@_; my %count; local *count = sub { my( $string) = @_; my $length = length( $string ); if ($length < $min_len) { for (keys %count) { delete $count{$_} if $count{$_} < $min_rep; } return \%count; } for my $l ($min_len..$length) { my $s = substr( $string, 0, $l ); $count{ $s } += 1; } count( substr( $string, 1 ) ); }; count($str); \%count; } sub ikegami { my ($str, $min_len, $min_rep)=@_; local our %counts; use re 'eval'; $str =~ / (.{$min_len,}) # or (.+) (?(?{ !$counts{$1} })(?=.*\1)) (?{ ++$counts{$1} }) (?!) /x; for (keys %counts) { delete $counts{$_} if $counts{$_} < $min_rep; } \%counts; } sub lodin { my ($str, $min_len, $min_rep)=@_; local our %count; use re 'eval'; $str =~ / (.{$min_len,}) (?(?{ $count{$1} }) (?!) ) (?> .*? \1 ){@{[ $min_rep - 2 ]}} .* \1 (?{ ($count{$1} ||= $min_rep-1)++ }) (?!) /x; \%count; } { my %cache; sub _reference { $cache{$_[0]} ||= blazar @_ } } for my $len (1..3) { for my $rep (2..3) { my $tag="len=$len, rep=$rep"; is_deeply kramba($str,$len,$rep), _reference($str,$len,$rep), +"kramba $tag"; is_deeply ikegami($str,$len,$rep), _reference($str,$len,$rep), + "ikegami $tag"; is_deeply lodin($str,$len,$rep), _reference($str,$len,$rep), " +lodin $tag"; } } __END__
[download]

15 tests out 18 fail.

Update: I'm an idiot! All tests run smoothly once I use the correct reference. The lesson in this is: don't post when you're too tired. I'm going to sleep and I will update the node with the actual benchmark tomorrow morning... Sorry for the noise!

Update: Ok, I woke up and I'm not that tired anymore. Here's the complete script:

#!/usr/bin/perl

use strict;
use warnings;
use Test::More 'no_plan';
use Benchmark qw/:all :hireswallclock/;

my $str='aabcdabcabcecdecd';

sub blazar {
    my ($str, $min_len, $min_rep)=@_;
    my $l=length($str);
    my %count;

    for my $off (0..$l-1) {
        for my $len ($min_len .. $l-$off) {
            my $s = substr $str, $off, $len;
            $count{ $s } ||= ()= $str =~ /$s/g;
        }
        $count{$_} < $min_rep and
          delete $count{$_} for keys %count;
    }
    \%count;
}

sub kramba {
    my ($str, $min_len, $min_rep)=@_;
    my %count;
    no warnings 'recursion';
    local *count = sub {
        my( $string) = @_;
        my $length = length( $string );

        if ($length < $min_len) {
            for (keys %count) {
                delete $count{$_}
                    if $count{$_} < $min_rep;
            }
            return \%count;
        }

        for my $l ($min_len..$length) {
            my $s = substr( $string, 0, $l );
            $count{ $s } += 1;
        }

        count( substr( $string, 1 ) );
    };
    count($str);
    \%count;
}

sub ikegami {
    my ($str, $min_len, $min_rep)=@_;
    local our %counts;
    use re 'eval';
    $str =~ /
      (.{$min_len,})   # or (.+)
      (?(?{ !$counts{$1} })(?=.*\1))
      (?{ ++$counts{$1} })
      (?!)
      /x;
    for (keys %counts) {
        delete $counts{$_} if $counts{$_} < $min_rep;
    }
    \%counts;
}

sub lodin {
    my ($str, $min_len, $min_rep)=@_;
    local our %count;
    use re 'eval';
    $str =~ /
      (.{$min_len,})
      (?(?{ $count{$1} })
      (?!)
      )
      (?>
      .*?
      \1
      ){@{[ $min_rep - 2 ]}}
      .*
      \1
      (?{ ($count{$1} ||= $min_rep-1)++ })
      (?!)
      /x;
    \%count;
}

{
    my %cache;
    sub _reference { $cache{"@_"} ||= blazar @_ }
}

for my $s ( map {$str x $_} 1,3,42) {
    for my $len (1..2) {
        for my $rep (2..3) {
            my $strlen=length $s;
            print "\nstring length=$strlen, len=$len, rep=$rep\n\n";
            cmpthese +($strlen < 100 ? 10_000 : -60) => {
                kramba  => sub { kramba($s,$len,$rep) },
                ikegami => sub { ikegami($s,$len,$rep) },
                lodin   => sub { lodin($s,$len,$rep) },
            };
        }
    }
}

print "\n";

for my $len (1..3) {
    for my $rep (2..3) {
        my $tag="len=$len, rep=$rep";
        is_deeply kramba($str,$len,$rep), _reference($str,$len,$rep), 
+"kramba $tag";
        is_deeply ikegami($str,$len,$rep), _reference($str,$len,$rep),
+ "ikegami $tag";
        is_deeply lodin($str,$len,$rep), _reference($str,$len,$rep), "
+lodin $tag";
    }
}

__END__
[download]

And here's the output:

string length=17, len=1, rep=2

          Rate  kramba ikegami   lodin
kramba  1860/s      --    -59%    -60%
ikegami 4570/s    146%      --     -1%
lodin   4636/s    149%      1%      --

string length=17, len=1, rep=3

          Rate  kramba   lodin ikegami
kramba  1866/s      --    -58%    -60%
lodin   4413/s    137%      --     -5%
ikegami 4638/s    149%      5%      --

string length=17, len=2, rep=2

          Rate  kramba   lodin ikegami
kramba  2000/s      --    -63%    -64%
lodin   5470/s    174%      --     -3%
ikegami 5618/s    181%      3%      --

string length=17, len=2, rep=3

          Rate  kramba   lodin ikegami
kramba  1905/s      --    -59%    -66%
lodin   4604/s    142%      --    -18%
ikegami 5612/s    195%     22%      --

string length=51, len=1, rep=2

         Rate  kramba   lodin ikegami
kramba  212/s      --     -0%    -18%
lodin   212/s      0%      --    -18%
ikegami 259/s     22%     22%      --

string length=51, len=1, rep=3

         Rate  kramba ikegami   lodin
kramba  201/s      --     -5%     -6%
ikegami 212/s      6%      --     -1%
lodin   215/s      7%      1%      --

string length=51, len=2, rep=2

         Rate  kramba   lodin ikegami
kramba  202/s      --     -5%     -7%
lodin   213/s      5%      --     -2%
ikegami 217/s      7%      2%      --

string length=51, len=2, rep=3

         Rate  kramba ikegami   lodin
kramba  200/s      --     -5%    -10%
ikegami 210/s      5%      --     -5%
lodin   221/s     11%      5%      --

string length=714, len=1, rep=2

        s/iter   lodin ikegami  kramba
lodin     2.50      --     -5%    -50%
ikegami   2.37      6%      --    -48%
kramba    1.24    102%     91%      --

string length=714, len=1, rep=3

        s/iter   lodin ikegami  kramba
lodin     3.48      --    -30%    -64%
ikegami   2.45     42%      --    -49%
kramba    1.25    180%     96%      --

string length=714, len=2, rep=2

        s/iter   lodin ikegami  kramba
lodin     2.50      --     -1%    -51%
ikegami   2.49      1%      --    -50%
kramba    1.24    103%    101%      --

string length=714, len=2, rep=3

        s/iter   lodin ikegami  kramba
lodin     3.40      --    -28%    -63%
ikegami   2.44     39%      --    -49%
kramba    1.25    173%     96%      --

ok 1 - kramba len=1, rep=2
ok 2 - ikegami len=1, rep=2
ok 3 - lodin len=1, rep=2
ok 4 - kramba len=1, rep=3
ok 5 - ikegami len=1, rep=3
ok 6 - lodin len=1, rep=3
ok 7 - kramba len=2, rep=2
ok 8 - ikegami len=2, rep=2
ok 9 - lodin len=2, rep=2
ok 10 - kramba len=2, rep=3
ok 11 - ikegami len=2, rep=3
ok 12 - lodin len=2, rep=3
ok 13 - kramba len=3, rep=2
ok 14 - ikegami len=3, rep=2
ok 15 - lodin len=3, rep=2
ok 16 - kramba len=3, rep=3
ok 17 - ikegami len=3, rep=3
ok 18 - lodin len=3, rep=3
1..18
[download]

Krambambuli seems to win in the long string case. Anyway, analysis? Refinements, comments, additions?

Update: well done, lodin++.

In reply to Re^6: how to count the number of repeats in a string (really!) by blazar
in thread how to count the number of repeats in a string (really!) by blazar

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.