'AGTGTAT' would generate >16000 columns for every combination),
Much much less if you don't list all the *possible* combinations.
To find the subsequences:
sub scan_seq { local our %sub_seqs; $_[0] =~ /(.{7})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; }
All together:
use strict; use warnings; sub scan_seq { my $len = shift; local our %sub_seqs; use re 'eval'; $_[0] =~ /(.{$len})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; } my @seqs = qw( ATGCTGTACTG ATGATGATCTG ATGCGTATGCA ATGCTAGACTG ); my $sub_seq_len = 7; my $max_len = length('Total'); my %cols; my %grid; for my $seq (@seqs) { $max_len = length($seq) if length($seq) > $max_len; my $row = scan_seq($sub_seq_len, $seq); $grid{$seq} = $row; $cols{$_} += $row->{$_} for keys %$row; } my @cols = sort keys %cols; my $col_sep = ' '; my $fmt_h = "%-${max_len}s" . ("$col_sep%-${sub_seq_len}s" x @cols) . +"\n"; my $fmt_b = "%-${max_len}s" . ("$col_sep%${sub_seq_len}d" x @cols) . +"\n"; printf($fmt_h, '', @cols); for my $seq (@seqs) { my $row = $grid{$seq}; printf($fmt_b, $seq, map $_||0, @$row{@cols}); } printf($fmt_b, 'Total', map $_||0, @cols{@cols});
ATGATCT ATGATGA ATGCGTA ATGCTAG ATGCTGT CGTATGC CTA +GACT CTGTACT GATGATC GCGTATG GCTAGAC GCTGTAC GTATGCA TAGACTG +TGATCTG TGATGAT TGCGTAT TGCTAGA TGCTGTA TGTACTG ATGCTGTACTG 0 0 0 0 1 0 + 0 1 0 0 0 1 0 0 + 0 0 0 0 1 1 ATGATGATCTG 1 1 0 0 0 0 + 0 0 1 0 0 0 0 0 + 1 1 0 0 0 0 ATGCGTATGCA 0 0 1 0 0 1 + 0 0 0 1 0 0 1 0 + 0 0 1 0 0 0 ATGCTAGACTG 0 0 0 1 0 0 + 1 0 0 0 1 0 0 1 + 0 0 0 1 0 0 Total 1 1 1 1 1 1 + 1 1 1 1 1 1 1 1 + 1 1 1 1 1 1
If you really want to print all 16,384 columns, then replace
withmy @cols = sort keys %cols;
my @cols = glob('{A,C,G,T}' x $sub_seq_len);
It'll still use the minimum memory required.
Update: Added complete code.
Update: Made subsequence length variable instead of hard-coded.
Update: Added change needed to display all columns.
In reply to Re: Find number of short words in long word
by ikegami
in thread Find number of short words in long word
by sedm1000
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |