Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: substrings that consist of repeating characters (updated x3)

by AnomalousMonk (Archbishop)
on Sep 27, 2020 at 18:23 UTC ( [id://11122269]=note: print w/replies, xml ) Need Help??


in reply to substrings that consist of repeating characters

Win8 Strawberry 5.8.9.5 (32) Sun 09/27/2020 14:19:34 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); my $string = 'ACGTAAAAATGCCCATGGGGGGG'; my @repeats = do { my $p; grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg; }; dd \@repeats; __END__ ["AAAAA", "CCC", "GGGGGGG"]

Update 1: But you also want lengths:

Win8 Strawberry 5.8.9.5 (32) Sun 09/27/2020 14:20:42 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); my $string = 'ACGTAAAAATGCCCATGGGGGGG'; my @repeats_and_lengths = do { my $p; map [ $_, length ], grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg; }; dd \@repeats_and_lengths; __END__ [["AAAAA", 5], ["CCC", 3], ["GGGGGGG", 7]]
You already know how to sort this. :)

Update 2:

... there are statements in the while loop that look doubtful ...
Other than the useless /g modifier on the /.../g regex, | oops... not useless! I don't see anything objectionable. There are usually several ways to do anything and which is "best" is often a question of taste — unless you're Benchmark-ing.
... the idea of using an array to store the substring along with its length might not be good.
Again, I see nothing to gripe about. It's a matter of taste and the best impedance match to the rest of the code.

Update 3: Oh, and one more thing... If you're doing a buncha matching operations on a buncha long sequences, it might be useful to add a validation step for each input sequence to be sure it consists only in [ATCG] characters before any further matching operations are done. This allows you to match with . (dot) and know that you can only be matching a valid base character. This might save significant time over many matches, but this can only be determined for sure by benchmarking. (I'd be inclined to add a validation step anyway just to be sure your data really is what you think it is.)


Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11122269]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-03-29 05:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found