Anyway, analysis? Refinements, comments, additions?

These routines are still not equivalent. They all give different results for e.g. 'aaaaa'. You're not using ikegami's version from Re^6: how to count the number of repeats in a string (really!). With that version, ikegami's and krambambuli's versions are equivalent. A simple modification to the regex in your routine makes that equivalent too. I also removed the tail recursion from kramba, which improves the performance with about 50 %.

The return value from these routines should be interpreted as the answer to "how many unique substrings are there in the string, and what's their frequency". This is a different problem from "how many times is a substring sequentially repeated (possibly with other substrings inbetween) in the string", which is the problem most of us solved initially.

Here's the code and the output:

use strict; use warnings; use Test::More 'no_plan'; use Benchmark qw/:all :hireswallclock/; my $str = 'aabcdabcabcecdecdaaaa'; sub blazar { my ($str, $min_len, $min_rep)=@_; my $l=length($str); my %count; for my $off (0..$l-1) { for my $len ($min_len .. $l-$off) { my $s = quotemeta substr $str, $off, $len; $count{ $s } ||= () = $str =~ /(?=$s)./gs; } $count{$_} < $min_rep and delete $count{$_} for keys %count; } \%count; } sub kramba { my ($str, $min_len, $min_rep)=@_; my %count; for my $c (0 .. length($str) - $min_len) { my $string = substr($str, $c); for my $l ($min_len .. length $string) { $count{substr($string, 0, $l)}++; } } for (keys %count) { delete $count{$_} if $count{$_} < $min_rep; } return \%count; } sub ikegami { my ($str, $min_len, $min_rep)=@_; local our %counts; use re 'eval'; $str =~ / (.{$min_len,}) (?{ ++$counts{$1} }) (?!) /x; for (keys %counts) { delete $counts{$_} if $counts{$_} < $min_rep; } \%counts; } my %subs = ( ikegami => \&ikegami, blazar => \&blazar, kramba => \&kramba, ); for my $len (1..3) { for my $rep (2..3) { my $tag="len=$len, rep=$rep"; my $ref = ikegami($str, $len, $rep); for my $name (keys %subs) { my $code = $subs{$name}; is_deeply($code->($str, $len, $rep), $ref, "$name - $tag") +; } } } print "\n"; for my $s ( map {$str x $_} 1,3,42) { for my $len (1..2) { for my $rep (2..3) { printf "length=%d, min_len=%d, min_rep=%d\n", length $s, $len, $rep ; cmpthese(-60, { map { my $c = $subs{$_}; $_ => sub { $c->($s, $len, $rep) } } keys %subs }); print "\n"; } } } __END__ ok 1 - blazar - len=1, rep=2 ok 2 - ikegami - len=1, rep=2 ok 3 - kramba - len=1, rep=2 ok 4 - blazar - len=1, rep=3 ok 5 - ikegami - len=1, rep=3 ok 6 - kramba - len=1, rep=3 ok 7 - blazar - len=2, rep=2 ok 8 - ikegami - len=2, rep=2 ok 9 - kramba - len=2, rep=2 ok 10 - blazar - len=2, rep=3 ok 11 - ikegami - len=2, rep=3 ok 12 - kramba - len=2, rep=3 ok 13 - blazar - len=3, rep=2 ok 14 - ikegami - len=3, rep=2 ok 15 - kramba - len=3, rep=2 ok 16 - blazar - len=3, rep=3 ok 17 - ikegami - len=3, rep=3 ok 18 - kramba - len=3, rep=3 length=21, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 364/s -- -84% -86% ikegami 2329/s 539% -- -13% kramba 2674/s 634% 15% -- length=21, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 355/s -- -85% -87% ikegami 2316/s 553% -- -16% kramba 2744/s 674% 18% -- length=21, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 381/s -- -85% -87% ikegami 2462/s 546% -- -13% kramba 2826/s 642% 15% -- length=21, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 381/s -- -84% -86% ikegami 2424/s 537% -- -13% kramba 2802/s 636% 16% -- length=63, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 28.8/s -- -91% -94% ikegami 328/s 1039% -- -26% kramba 444/s 1441% 35% -- length=63, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 35.1/s -- -90% -92% ikegami 340/s 870% -- -21% kramba 432/s 1130% 27% -- length=63, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 32.0/s -- -91% -93% ikegami 355/s 1009% -- -21% kramba 452/s 1312% 27% -- length=63, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 35.0/s -- -90% -92% ikegami 335/s 855% -- -21% kramba 423/s 1106% 26% -- length=882, min_len=1, min_rep=2 Rate blazar ikegami kramba blazar 6.46e-002/s -- -95% -95% ikegami 1.20/s 1754% -- -16% kramba 1.42/s 2104% 19% -- length=882, min_len=1, min_rep=3 Rate blazar ikegami kramba blazar 6.16e-002/s -- -95% -96% ikegami 1.17/s 1804% -- -17% kramba 1.41/s 2189% 20% -- length=882, min_len=2, min_rep=2 Rate blazar ikegami kramba blazar 6.14e-002/s -- -95% -96% ikegami 1.18/s 1829% -- -17% kramba 1.42/s 2213% 20% -- length=882, min_len=2, min_rep=3 Rate blazar ikegami kramba blazar 6.27e-002/s -- -95% -96% ikegami 1.18/s 1779% -- -18% kramba 1.44/s 2197% 22% -- 1..18

lodin


In reply to Re^7: how to count the number of repeats in a string (really!) by lodin
in thread how to count the number of repeats in a string (really!) by blazar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.