in reply to substrings that consist of repeating characters
Given biological data can be huge, using Perl's builtin string-handling functions can often be far more efficient than using regexes. Using Benchmark can help when choosing a solution.
The following code still uses regexes but only minimally:
#!/usr/bin/env perl use 5.014; use warnings; my $string = 'AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT'; my $min_repeat = 2; for my $base (qw{A C G T}) { say "$base: ", get_longest_length($string, $base, $min_repeat); } sub get_longest_length { my ($str, $base, $min) = @_; my $re = '[' . 'ACGT' =~ s/$base//r . ']+'; return ( sort { length $b <=> length $a } grep length $_ >= $min, split /$re/, $str )[0]; }
Output:
A: AAA C: CCCCCC G: GGGG T: TTT
Notes:
— Ken
|
|---|