in reply to how to count the number of repeats in a string (really!)

This is a perfect task for the regex engine.

local our %count; $str =~ / (.+) # or .{N,} where N is minimum length. (?(?{ $count{$1} }) (?!) ) .* \1 (?{ ($count{$1} ||= 1)++ }) (?!) /x;
A more generalized version where you can specify the minimum substring length and minimum number of occurances is
my $min_len = 2; # Substring is at least two chars long. my $min_count = 3; # Substring occures at least three times. local our %count; use re 'eval'; $str =~ / (.{$min_len,}) (?(?{ $count{$1} }) (?!) ) (?> .*? \1 ){@{[ $min_count - 2 ]}} .* \1 (?{ ($count{$1} ||= $min_count-1)++ }) (?!) /x;

lodin

Update:

While writing this, ikegami posted a very similar-looking reply. While they look very much alike they work quite differently. ikegami's work by requiring that each match is repeated further into the string, and then goes on to count all those successive matches kind of like a global match. Mine does the counting right away, and then forces the engine to not count those again. So they work rather opposite of each other.

I did a shallow benchmark. It seems that mine is a slight favourite (5-10%) in many, but not all, situations. I also get the impression that ikegami's scales slightly better, see Re^5: how to count the number of repeats in a string (really!).

Update 2:

Added the comment in the regex.

Update 3:

Added the generalized version.

Replies are listed 'Best First'.
Re^2: how to count the number of repeats in a string (really!) [regexp solution]
by ikegami (Patriarch) on Nov 16, 2007 at 20:41 UTC
    You made the same mistake I did. It fails to find two 'aba' in 'ababa'.

      Mistake and mistake. That's not how I interpreted the question. The problem statement is a bit vague on this. For practical purposes I don't think 'aaa' should report 2 x 'aa' as well.

      If this is the task however, it's better to just exhaustively match everything of a minimum length. The problem as I (and you?) interpreted it is actually more interesting, I think.

      lodin