in reply to substrings that consist of repeating characters
Win8 Strawberry 5.8.9.5 (32) Sun 09/27/2020 14:19:34 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); my $string = 'ACGTAAAAATGCCCATGGGGGGG'; my @repeats = do { my $p; grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg; }; dd \@repeats; __END__ ["AAAAA", "CCC", "GGGGGGG"]
Update 1: But you also want lengths:
You already know how to sort this. :)Win8 Strawberry 5.8.9.5 (32) Sun 09/27/2020 14:20:42 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); my $string = 'ACGTAAAAATGCCCATGGGGGGG'; my @repeats_and_lengths = do { my $p; map [ $_, length ], grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg; }; dd \@repeats_and_lengths; __END__ [["AAAAA", 5], ["CCC", 3], ["GGGGGGG", 7]]
Update 2:
... there are statements in the while loop that look doubtful ...
... the idea of using an array to store the substring along with its length might not be good.Again, I see nothing to gripe about. It's a matter of taste and the best impedance match to the rest of the code.
Update 3: Oh, and one more thing... If you're doing a buncha matching operations on a buncha long sequences, it might be useful to add a validation step for each input sequence to be sure it consists only in [ATCG] characters before any further matching operations are done. This allows you to match with . (dot) and know that you can only be matching a valid base character. This might save significant time over many matches, but this can only be determined for sure by benchmarking. (I'd be inclined to add a validation step anyway just to be sure your data really is what you think it is.)
Give a man a fish: <%-{-{-{-<
|
|---|