Inspired by today's how to count the no of repeats in a string, I came up with an admittedly artificial but possibly very interesting problem: that of counting for a string the repetition count of each substring that is actually repeated. For example, given the string

my $str = 'aabcdabcabcecdecd';

I see that -limiting ourselves to substrings of length 2 or more- we have the following counts:

(abc => 3, bc => 3, cd => 3, ecd => 2)

One possible way to get those counts is as follows -I'm casting the thing in the form of a sub-:

sub count1 { local $_=shift; my %count; for my $pos (0..(length)-1) { pos=$pos; @count{ /(.{2,}).*\1/g } = (); } for my $k (keys %count) { $count{$k} =()= /$k/g; } \%count; }

In fact, print Dumper count1 $str; gives me:

$VAR1 = { 'cd' => 3, 'ecd' => 2, 'abc' => 3, 'bc' => 3 };

To my unpleasant surprise, if I change .{2,} to .+ I do not get all substrings (of length 1):

$VAR1 = { 'cd' => 3, 'c' => 5, 'ecd' => 2, 'a' => 4, 'abc' => 3, 'd' => 3, 'bc' => 3 };

And I can't understand why...

I personally believe that map based solutions are often very cute and so I rolled my own too:

sub count2 { my $s=shift; my %saw; my %count = map { map { $saw{$_}++ ? () : $_ => scalar(()= $s =~ /$_/g); } do { pos=$_; $s =~ /(.{2,}).*\1/g }; } 0..length($s)-1; \%count; }

But the latter

  1. is clumsier than the non-map version and;
  2. what's worst, is not correct at all:
    $VAR1 = { 'ecd' => 2, 'abc' => 3, '3' => 2 };

And I can't understand why, either.

What's your take on the problem?


In reply to how to count the number of repeats in a string (really!) by blazar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.