Re: substrings that consist of repeating characters

TMTOWTDI

Given biological data can be huge, using Perl's builtin string-handling functions can often be far more efficient than using regexes. Using Benchmark can help when choosing a solution.

The following code still uses regexes but only minimally:

#!/usr/bin/env perl

use 5.014;
use warnings;

my $string = 'AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT';
my $min_repeat = 2;

for my $base (qw{A C G T}) {
    say "$base: ", get_longest_length($string, $base, $min_repeat);
}

sub get_longest_length {
    my ($str, $base, $min) = @_;

    my $re = '[' . 'ACGT' =~ s/$base//r . ']+';
    return (
        sort { length $b <=> length $a }
        grep length $_ >= $min, split /$re/, $str
    )[0];
}
[download]

Output:

A: AAA
C: CCCCCC
G: GGGG
T: TTT
[download]

Notes:

I've specified v5.14 to use the 'r' modifier. See "perl5140delta: Non-destructive substitution".
You can use index to find the number and position(s) of maximum-length substring(s).
There are a number of optimisations that could be applied, but that will largely depend on your intended usage of this code.

— Ken

Comment on Re: substrings that consist of repeating characters Select or Download Code