Re: substrings that consist of repeating characters (updated x3)

Win8 Strawberry 5.8.9.5 (32)  Sun 09/27/2020 14:19:34
C:\@Work\Perl\monks
>perl
use strict;
use warnings;

use Data::Dump qw(dd);

my $string = 'ACGTAAAAATGCCCATGGGGGGG';

my @repeats = do {
    my $p;
    grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg;
    };

dd \@repeats;

__END__
["AAAAA", "CCC", "GGGGGGG"]
[download]

Update 1: But you also want lengths:

Win8 Strawberry 5.8.9.5 (32)  Sun 09/27/2020 14:20:42
C:\@Work\Perl\monks
>perl
use strict;
use warnings;

use Data::Dump qw(dd);

my $string = 'ACGTAAAAATGCCCATGGGGGGG';

my @repeats_and_lengths = do {
    my $p;
    map  [ $_, length ],
    grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg;
    };

dd \@repeats_and_lengths;

__END__
[["AAAAA", 5], ["CCC", 3], ["GGGGGGG", 7]]
[download]

You already know how to sort this. :)

Update 2:

... there are statements in the while loop that look doubtful ...

~~Other than the useless /g modifier on the /.../g regex,~~ | oops... not useless! I don't see anything objectionable. There are usually several ways to do anything and which is "best" is often a question of taste — unless you're Benchmark-ing.

... the idea of using an array to store the substring along with its length might not be good.

Again, I see nothing to gripe about. It's a matter of taste and the best impedance match to the rest of the code.

Update 3: Oh, and one more thing... If you're doing a buncha matching operations on a buncha long sequences, it might be useful to add a validation step for each input sequence to be sure it consists only in [ATCG] characters before any further matching operations are done. This allows you to match with . (dot) and know that you can only be matching a valid base character. This might save significant time over many matches, but this can only be determined for sure by benchmarking. (I'd be inclined to add a validation step anyway just to be sure your data really is what you think it is.)

Give a man a fish: <%-{-{-{-<

Comment on Re: substrings that consist of repeating characters (updated x3) Select or Download Code