Re: Faster regex to split a string into runs of similar characters?

in reply to Faster regex to split a string into runs of similar characters?

Depending on how you want to use the information this may help:

use Benchmark qw( cmpthese );

$s = join'', map{ chr( 65+rand(26) ) x rand( 100 ) } 1 .. 1000;;

push @first, $1 while $s =~ /((.)\2*)/gs;
$s2 = " $s" ^ $s; # XORing the string with a shifted copy of itself, s
+o that you have a series of 0s for identical characters
push @second, $1 while $s2 =~ /(.\o{0}*)/gs;

$\ = $/ x 2;
print pack "(A4)*", map length, @first;
print pack "(A4)*", map length, @second;

cmpthese -1,{
    a=>q[ 
        1 while $s =~ m[((?=(.))\2+)]g;
    ],
    b=>q[
        1 while $s =~ m[((.)\2*)]sg;
    ] ,
    c=>q[$s3 = " $s" ^ $s;; 1 while $s3 =~ /(.\o{0}*)/gs],
};;
[download]

This first prints the length of the strings found by the two methods (I have removed most of the lines, which don't add any more information):

55  97  65  7   87  60  53  98  2   71  35  68  67  58  12  19  17  22
+  5   28  63  96  30  18  32  6   37  27  47  68  79  97  2   9   60 
+ 75  87  31  15  82  62  78  33  69  10  35  4   82  61  33  63  82  
+96  68  140
88  59  67  87  78  98  14  3   6   52  59  74  86  79  49  44  28  76
+  25  83  99  66  42  67  73  3   46

55  97  65  7   87  60  53  98  2   71  35  68  67  58  12  19  17  22
+  5   28  63  96  30  18  32  6   37  27  47  68  79  97  2   9   60 
+ 75  87  31  15  82  62  78  33  69  10  35  4   82  61  33  63  82  
+96  68  140
88  59  67  87  78  98  14  3   6   52  59  74  86  79  49  44  28  76
+  25  83  99  66  42  67  73  3   46  1
[download]

So the second method does give the correct length (plus an extra character because of the shift).

And the benchmark is much faster:

    Rate    a    b    c
a  445/s   --  -4% -80%
b  465/s   5%   -- -79%
c 2228/s 401% 379%   --
[download]

It does not provide all the information of other methods directly (you still have to get a character in the original string to know what a substring exactly is), but might be useful depending on what you actually need.

In Section Seekers of Perl Wisdom