Re: substrings that consist of repeating characters
by GrandFather (Saint) on Sep 27, 2020 at 22:26 UTC
|
In Perl length is cheap so calculate it when you need it. The following code is a little more Perlish but, other than using a threshold to drop out short strings, is similar to your code. For varieties sake the regex has changed slightly to be a little easier to grok:
use strict;
use warnings;
my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT";
my @runs;
my $threshold = 3;
length $1 >= $threshold && (push @runs, $1) while $string =~ /(A+|C+|G
++|T+)/g;
@runs = sort {length($b) <=> length($a)} @runs;
printf "@runs\n";
Prints:
CCCCCC GGGG AAA TTT TTT TTT
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
DB<56> $_ = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTAT
+TGGGGACTTT";
DB<57> $threshold = 3;
DB<58> x sort { length($b) <=> length($a) } grep { length >= $thresh
+old } /(A+|C+|G+|T+)/g
0 'CCCCCC'
1 'GGGG'
2 'AAA'
3 'TTT'
4 'TTT'
5 'TTT'
DB<59>
EDIT
and for the original problem
DB<59> x sort { length($b) <=> length($a) } /(AA+|CC+|GG+|TT+)/g
0 'CCCCCC'
1 'GGGG'
2 'AAA'
3 'TTT'
4 'TTT'
5 'TTT'
6 'TT'
7 'TT'
8 'AA'
9 'GG'
10 'GG'
11 'TT'
12 'AA'
13 'TT'
14 'TT'
DB<60>
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: substrings that consist of repeating characters
by kcott (Archbishop) on Sep 28, 2020 at 02:42 UTC
|
TMTOWTDI
Given biological data can be huge, using Perl's builtin string-handling functions
can often be far more efficient than using regexes.
Using Benchmark can help when choosing a solution.
The following code still uses regexes but only minimally:
#!/usr/bin/env perl
use 5.014;
use warnings;
my $string = 'AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT';
my $min_repeat = 2;
for my $base (qw{A C G T}) {
say "$base: ", get_longest_length($string, $base, $min_repeat);
}
sub get_longest_length {
my ($str, $base, $min) = @_;
my $re = '[' . 'ACGT' =~ s/$base//r . ']+';
return (
sort { length $b <=> length $a }
grep length $_ >= $min, split /$re/, $str
)[0];
}
Output:
A: AAA
C: CCCCCC
G: GGGG
T: TTT
Notes:
-
I've specified v5.14 to use the 'r' modifier.
See "perl5140delta: Non-destructive substitution".
-
You can use index to find the number and position(s)
of maximum-length substring(s).
-
There are a number of optimisations that could be applied,
but that will largely depend on your intended usage of this code.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: substrings that consist of repeating characters
by tybalt89 (Monsignor) on Sep 27, 2020 at 20:23 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11122267
use warnings;
my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT";
my @substrings;
push @{ $substrings[length $1] }, $1 while $string =~ /(([ACTG])\2+)/g
+;
my @sorted = map @{ $_ // [] }, reverse @substrings;
use Data::Dump 'dd'; dd \@sorted;
Outputs:
[
"CCCCCC",
"GGGG",
"AAA",
"TTT",
"TTT",
"TTT",
"TT",
"TT",
"AA",
"GG",
"GG",
"TT",
"AA",
"TT",
"TT",
]
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: substrings that consist of repeating characters
by johngg (Canon) on Sep 28, 2020 at 11:26 UTC
|
Just in case you need offsets as well, here's a solution for that.
use strict;
use warnings;
use feature qw{ say };
my $string
= q{AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATTGGGGACT
+TT};
my @matches;
push @matches, [ length $1, $1, $-[ 0 ] ]
while $string =~ m{(([ACGT])\2+)}g;
say qq{Found $_->[ 1 ], length $_->[ 0 ] at offset $_->[ 2 ]} for
sort { $b->[ 0 ] <=> $a->[ 0 ]
||
$a->[ 1 ] cmp $b->[ 1 ]
||
$a->[ 2 ] <=> $b->[ 2 ]
} @matches;
The output, sorted ascending offset within ascending letter within descending length.
Found CCCCCC, length 6 at offset 42
Found GGGG, length 4 at offset 56
Found AAA, length 3 at offset 0
Found TTT, length 3 at offset 3
Found TTT, length 3 at offset 27
Found TTT, length 3 at offset 62
Found AA, length 2 at offset 13
Found AA, length 2 at offset 48
Found GG, length 2 at offset 15
Found GG, length 2 at offset 25
Found TT, length 2 at offset 8
Found TT, length 2 at offset 11
Found TT, length 2 at offset 39
Found TT, length 2 at offset 51
Found TT, length 2 at offset 54
I hope this is helpful.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: substrings that consist of repeating characters (updated x3)
by AnomalousMonk (Archbishop) on Sep 27, 2020 at 18:23 UTC
|
Win8 Strawberry 5.8.9.5 (32) Sun 09/27/2020 14:19:34
C:\@Work\Perl\monks
>perl
use strict;
use warnings;
use Data::Dump qw(dd);
my $string = 'ACGTAAAAATGCCCATGGGGGGG';
my @repeats = do {
my $p;
grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg;
};
dd \@repeats;
__END__
["AAAAA", "CCC", "GGGGGGG"]
Update 1: But you also want lengths:
Win8 Strawberry 5.8.9.5 (32) Sun 09/27/2020 14:20:42
C:\@Work\Perl\monks
>perl
use strict;
use warnings;
use Data::Dump qw(dd);
my $string = 'ACGTAAAAATGCCCATGGGGGGG';
my @repeats_and_lengths = do {
my $p;
map [ $_, length ],
grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg;
};
dd \@repeats_and_lengths;
__END__
[["AAAAA", 5], ["CCC", 3], ["GGGGGGG", 7]]
You already know how to sort this. :)
Update 2:
... there are statements in the while loop that look doubtful ...
Other than the useless /g modifier on the /.../g regex, | oops... not useless!
I don't see anything objectionable. There are usually several ways
to do anything and which is "best" is often a question of taste
— unless you're Benchmark-ing.
... the idea of using an array to store the
substring along with its length might not be good.
Again, I see nothing to gripe about. It's a matter of taste and the
best impedance match to the rest of the code.
Update 3:
Oh, and one more thing... If you're doing a buncha matching
operations on a buncha long sequences, it might be useful to add a
validation step for each input sequence to be sure it consists only
in [ATCG] characters before any further matching operations
are done. This allows you to match with . (dot) and know that
you can only be matching a valid base character. This might
save significant time over many matches, but this can only be
determined for sure by benchmarking. (I'd be inclined to add a
validation step anyway just to be sure your data really is what you
think it is.)
Give a man a fish: <%-{-{-{-<
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: substrings that consist of repeating characters
by LanX (Saint) on Sep 27, 2020 at 20:06 UTC
|
The simplest way to do it, demonstrated in the debugger
DB<39> $_ = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTAT
+TGGGGACTTT";
DB<40> push @substr, $1 while /((\w)\2+)/g
DB<41> @sorted = sort { length($b) <=> length($a) } @substr
DB<42> x @sorted
0 'CCCCCC'
1 'GGGG'
2 'AAA'
3 'TTT'
4 'TTT'
5 'TTT'
6 'TT'
7 'TT'
8 'AA'
9 'GG'
10 'GG'
11 'TT'
12 'AA'
13 'TT'
14 'TT'
DB<43>
Storing the length in @substr for a Schwartzian transform might be faster, but I wouldn't bet on this.
IMHO is length only doing a simple lookup of the pre-calculated length inside Perl's data-structure for strings and should be pretty fast.
HTH! :)
update
you could also do sort and dump in one line:
DB<43> print join "\n", sort { length($b)<=>length($a) } @substr
CCCCCC
GGGG
AAA
TTT
TTT
TTT
TT
TT
AA
GG
GG
TT
AA
TT
TT
DB<44>
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
A slight simplification can be gained by using the 'nsort_by' function from List::UtilsBy (or its XS equivalent). You can also use the special variable '$,' rather than 'join' to control the print.
use strict;
use warnings;
use List::UtilsBy::XS qw(nsort_by);
my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT";
my @matches;
push @matches, $& while ($string=~m/([AGCT])\1+/g);
local $, = "\n";
print nsort_by {length} @matches ;
| [reply] [Watch: Dir/Any] [d/l] |
Re: substrings that consist of repeating characters
by salva (Canon) on Sep 28, 2020 at 20:50 UTC
|
A simpler variation of your code:
use strict;
use warnings;
my $string = "AAAAAAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGT
+TTTTTTTTTTTTTTTTTATTGGGGACTTT";
my $len = 0;
my $best = "";
while ($string =~ /((.)\2{$len,})/g) {
$len = length $1;
$best = $1
}
print "best: $best\n"
| [reply] [Watch: Dir/Any] [d/l] |
|
use strict;
use warnings;
my $string = "AAAATTTAGTTCTTAAGGCTGACATCACGTCAGCGTTACCCCCCAAGATTGGGGAC
+TTT";
my $len = 0;
my $best = '';
$best = $1, $len = length $1 while $string =~ /((.)\2{$len,})/g;
print "best: $best ($len)\n"
Prints:
best: CCCCCC (6)
Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
my $len = 0;
my $best = "";
while ($string =~ /((.)(?:(*SKIP)\2){$len,})/g) {
$len = length $1;
$best = $1
}
print "best: $best\n"
But that is still not completely efficient: the regexp is recompiled at every loop iteration because of $len, so maybe the following simpler code could be faster:
my $best = "";
while ($string =~ /((.)\2+)/g) {
$best = $1 if length $1 > length $best
}
print "best: $best\n"
Or maybe this more convoluted variation:
my $best = "";
$best = $1 while $string =~ /((.)\2*)(*SKIP)(?(?{length $^N <= length
+$best})(*FAIL))/g;
print "best: $best\n"
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Win8 Strawberry 5.30.3.1 (64) Tue 09/29/2020 13:32:10
C:\@Work\Perl\monks
>perl
use strict;
use warnings;
my $string = 'AABBBBCCC';
my $len = 0;
my $best = "";
while ($string =~ /((.)(?:(*SKIP)\2){$len,})/g) {
$len = length $1;
$best = $1
}
print "best: '$best' \n"
^Z
best: ''
Give a man a fish: <%-{-{-{-<
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
Re: substrings that consist of repeating characters
by Tux (Canon) on Sep 29, 2020 at 11:40 UTC
|
my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT";
my %expect = qw( CCCCCC 1 GGGG 1 AAA 1 TTT 3 AA 2 GG 2 TT 5 );
use Test::More;
use Benchmark qw(cmpthese);
my %subs;
sub v1 {
%subs = ();
$subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/(
+[ACGT])\K(?!\1)/,/gr);
} # v1
sub v2 {
%subs = ();
$subs{$_}++ for grep m/^([ACGT])\1+$/ => split m/,/ => ($string =~
+ s/(\w)\K(?!\1)/,/gr);
} # v2
sub v3 {
%subs = ();
$subs{$_}++ for $string =~ m/(AA+|CC+|GG+|TT+)/g;
} # v3
sub v4 {
%subs = ();
$subs{$1}++ while $string =~ m{(([ACGT])\2+)}g;
} # v4
sub v5 {
%subs = ();
$subs{$&}++ while $string =~ m{([ACGT])\1+}g;
} # v5
v1 (); is_deeply (\%subs, \%expect, "v1");
v2 (); is_deeply (\%subs, \%expect, "v2");
v3 (); is_deeply (\%subs, \%expect, "v3");
v4 (); is_deeply (\%subs, \%expect, "v4");
v5 (); is_deeply (\%subs, \%expect, "v5");
printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a->
+[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs;
cmpthese (-2, { v1 => \&v1, v2 => \&v2, v3 => \&v3, v4 => \&v4, v5 =>
+\&v5 });
done_testing;
=>
ok 1 - v1
ok 2 - v2
ok 3 - v3
ok 4 - v4
ok 5 - v5
1 6 CCCCCC
1 4 GGGG
1 3 AAA
3 3 TTT
2 2 AA
2 2 GG
5 2 TT
Rate v2 v1 v3 v4 v5
v2 41981/s -- -30% -52% -56% -57%
v1 59864/s 43% -- -31% -38% -39%
v3 87244/s 108% 46% -- -9% -12%
v4 95919/s 128% 60% 10% -- -3%
v5 98685/s 135% 65% 13% 3% --
1..5
Enjoy, Have FUN! H.Merijn
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
IIRC, once perl sees $& anywhere in the program code, it starts to populate that variable (and $' and $`) for all the regular expression matches in the program. Using it impacts the performance of all the regular expressions in the code, not just those ones where it is actually needed!
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
perlvar does mention the issue, but it also says this has been fully fixed since v5.20.
Edit: so this would mean that you might still get the same relative positions for the different versions on older version of perls, because although $& would be significantly worse than the other solutions on their own, it would actually lower the performances of all other versions when used in the benchmark.
| [reply] [Watch: Dir/Any] [d/l] |
|
Edit: I thought I had a better version but no. I ran the same benchmark again and the results were not the same at all (actually the three solutions had very similar performances). Something went wrong with my first attempt
I'm actually consistantly getting result that are worse without backreferences which which I don't understand...
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT";
my %expect = qw( CCCCCC 1 GGGG 1 AAA 1 TTT 3 AA 2 GG 2 TT 5 );
my $n = shift // 1;
if ($n > 1) {
$string = $string x $n;
$_ *= $n for values %expect;
}
use Test::More;
use Benchmark qw(cmpthese);
my %subs;
my @v = map { "v$_" } 1 .. 8;
my %f; @f{@v} = (
sub { %subs = ();
$subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/(
+[ACGT])\K(?!\1)/,/gr);
}, # v1
sub { %subs = ();
$subs{$_}++ for grep m/^([ACGT])\1+$/ => split m/,/ => ($string =~
+ s/(\w)\K(?!\1)/,/gr);
}, # v2
sub { %subs = ();
$subs{$_}++ for $string =~ m/(AA+|CC+|GG+|TT+)/g;
}, # v3
sub { %subs = ();
$subs{$1}++ while $string =~ m{(([ACGT])\2+)}g;
}, # v4
sub { %subs = ();
$subs{$&}++ while $string =~ m{([ACGT])\1+}g;
}, # v5
sub { %subs = ();
$subs{$&}++ while $string =~ m{A{2,}|C{2,}|G{2,}|T{2,}}g;
}, # v6
sub { %subs = ();
$subs{$&}++ while $string =~ m{AA+|CC+|GG+|TT+}g;
}, # v7
sub { %subs = ();
$subs{$&}++ while $string =~ m{()AA+|CC+|GG+|TT+}g;
}, # v8
);
for (@v) {
$f{$_}->(); is_deeply (\%subs, \%expect, $_);
}
printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a->
+[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs;
cmpthese (-2, { map {( $_ => $f{$_} )} @v });
done_testing;
$ test.pl 1
ok 1 - v1
ok 2 - v2
ok 3 - v3
ok 4 - v4
ok 5 - v5
ok 6 - v6
ok 7 - v7
ok 8 - v8
1 6 CCCCCC
1 4 GGGG
1 3 AAA
3 3 TTT
2 2 AA
2 2 GG
5 2 TT
Rate v2 v1 v7 v3 v4 v5 v6 v8
v2 41819/s -- -30% -45% -53% -57% -58% -60% -63%
v1 60150/s 44% -- -21% -32% -38% -40% -43% -47%
v7 76560/s 83% 27% -- -13% -22% -23% -28% -32%
v3 88071/s 111% 46% 15% -- -10% -12% -17% -22%
v4 97745/s 134% 63% 28% 11% -- -2% -8% -13%
v5 99555/s 138% 66% 30% 13% 2% -- -6% -12%
v6 105700/s 153% 76% 38% 20% 8% 6% -- -6%
v8 112783/s 170% 88% 47% 28% 15% 13% 7% --
1..8
$ test.pl 20
ok 1 - v1
ok 2 - v2
ok 3 - v3
ok 4 - v4
ok 5 - v5
ok 6 - v6
ok 7 - v7
ok 8 - v8
20 6 CCCCCC
20 4 GGGG
20 3 AAA
60 3 TTT
40 2 AA
40 2 GG
100 2 TT
Rate v2 v1 v7 v3 v4 v5 v6 v8
v2 2327/s -- -29% -47% -52% -55% -57% -61% -65%
v1 3284/s 41% -- -26% -32% -37% -39% -45% -50%
v7 4419/s 90% 35% -- -9% -15% -17% -26% -33%
v3 4853/s 109% 48% 10% -- -7% -9% -18% -27%
v4 5215/s 124% 59% 18% 7% -- -3% -12% -21%
v5 5351/s 130% 63% 21% 10% 3% -- -10% -19%
v6 5934/s 155% 81% 34% 22% 14% 11% -- -10%
v8 6604/s 184% 101% 49% 36% 27% 23% 11% --
1..8
$ test.pl 2000
ok 1 - v1
ok 2 - v2
ok 3 - v3
ok 4 - v4
ok 5 - v5
ok 6 - v6
ok 7 - v7
ok 8 - v8
2000 6 CCCCCC
2000 4 GGGG
2000 3 AAA
6000 3 TTT
4000 2 AA
4000 2 GG
10000 2 TT
Rate v2 v1 v7 v3 v4 v5 v6 v8
v2 21.3/s -- -35% -50% -54% -60% -61% -64% -68%
v1 32.7/s 54% -- -23% -30% -38% -39% -45% -51%
v7 42.6/s 100% 30% -- -9% -19% -21% -28% -36%
v3 46.6/s 119% 42% 9% -- -12% -14% -21% -30%
v4 52.7/s 147% 61% 24% 13% -- -2% -11% -21%
v5 54.0/s 154% 65% 27% 16% 3% -- -9% -19%
v6 59.2/s 178% 81% 39% 27% 13% 10% -- -11%
v8 66.3/s 212% 103% 56% 42% 26% 23% 12% --
1..8
Enjoy, Have FUN! H.Merijn
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
you might want to run the different variants thru use re "debug" to see how they are translated into regex primitives. This might give you a clue what is happening.
| [reply] [Watch: Dir/Any] [d/l] |
|
Note also that the benchmark results may be different for other input strings. The one in the OP is short and all the same-char substrings are also short, so for instance, results may be different if you use a long string containing long same-char substrings.
| [reply] [Watch: Dir/Any] |
|
Re: substrings that consist of repeating characters
by Tux (Canon) on Sep 29, 2020 at 11:09 UTC
|
Just to add to the confusion, TIMTOWTDI
use 5.18.0;
use warnings;
my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT";
my %subs;
$subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/([ACG
+T])\K(?!\1)/,/gr);
printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a->
+[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs;
1 6 CCCCCC
1 4 GGGG
1 3 AAA
3 3 TTT
2 2 AA
2 2 GG
5 2 TT
Enjoy, Have FUN! H.Merijn
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: substrings that consist of repeating characters
by perlfan (Vicar) on Sep 28, 2020 at 01:37 UTC
|
Given your motivating example, you may have already seen BioPerl. If not, its features and inards might be worth studying. | [reply] [Watch: Dir/Any] |
Re: substrings that consist of repeating characters
by vr (Curate) on Sep 29, 2020 at 17:24 UTC
|
The task at hand shouts "RLE!!!" at me. General purpose RLE, efficiently (let's hope so) implemented (i.e. coded in C), accessed from Perl -- why, PDL, of course.
The benchmark below is probably very skewed because my test DNA consists of only short same base (nucleotide) fragments. Let's assume the ultimate goal is length of longest "C's" string and its position. The only other contestant is salva's code, modified to fit stated purpose. Sorry if I missed faster other monks' solution.
Note: sneaking Perl's scalar as PDL raw data looks hackish, which it is. Opening scalar as filehandle and then using readflex to stuff PDL raw data is, alas, too slow.
use strict;
use warnings;
use Time::HiRes 'time';
use Readonly;
Readonly my $SIZE => 10_000_000;
my $str;
{ # get us some data
use String::Random 'random_regex';
srand 1234;
$str = random_regex( "[ACTG]{$SIZE}" );
}
{
print "\nlet's test PDL!\n";
use PDL;
my $t = time;
my $p = PDL-> new_from_specification( byte, $SIZE );
${ $p-> get_dataref } = $str;
$p-> upd_data;
my ( $lengths, $values ) = $p-> rle;
my $cumu = $lengths-> cumusumover;
my $C_lengths = $lengths * ( $values == ord 'C' );
my ( undef, $max, undef, $max_ind ) = $C_lengths-> minmaximum;
report( $max, $cumu-> at( $max_ind - 1 ), time - $t )
}
{
print "\nlet's test pure Perl's re-engine!\n";
my $t = time;
my $best = [ -1, -1 ];
while ( $str =~ /((C)\2+)/g ) {
$best = [ length( $1 ), $-[ 1 ]]
if length $1 > $best-> [ 0 ]
}
report( @$best, time - $t )
}
sub report {
printf "\tmax run of C's is %d bases long at %d\n\ttime consumed:
+%f\n", @_
}
__END__
let's test PDL!
max run of C's is 11 bases long at 4367281
time consumed: 0.164513
let's test pure Perl's re-engine!
max run of C's is 11 bases long at 4367281
time consumed: 0.361907
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: substrings that consist of repeating characters
by wazat (Monk) on Sep 30, 2020 at 23:47 UTC
|
While I wouldn't recommend it, I didn't see the following approach:
my $string = 'AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT
+GGGGACTTT';
my %max;
$string =~ s[(.)\1*][if ( length($&) > ($max{$1} // 0) ) {$max{$1} = l
+ength($&); }]eg;
for my $k (sort keys %max) {
print"$max{$k} ", $k x $max{$k}, "\n";
}
output
3 AAA
6 CCCCCC
4 GGGG
3 TTT
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: substrings that consist of repeating characters
by Anonymous Monk on Sep 28, 2020 at 05:55 UTC
|
| [reply] [Watch: Dir/Any] |