'AGTGTAT' would generate >16000 columns for every combination),
Much much less if you don't list all the *possible* combinations.
To find the subsequences:
sub scan_seq {
local our %sub_seqs;
$_[0] =~ /(.{7})(?{ ++$sub_seqs{$1} })(?!)/;
return \%sub_seqs;
}
All together:
use strict;
use warnings;
sub scan_seq {
my $len = shift;
local our %sub_seqs;
use re 'eval';
$_[0] =~ /(.{$len})(?{ ++$sub_seqs{$1} })(?!)/;
return \%sub_seqs;
}
my @seqs = qw(
ATGCTGTACTG
ATGATGATCTG
ATGCGTATGCA
ATGCTAGACTG
);
my $sub_seq_len = 7;
my $max_len = length('Total');
my %cols;
my %grid;
for my $seq (@seqs) {
$max_len = length($seq) if length($seq) > $max_len;
my $row = scan_seq($sub_seq_len, $seq);
$grid{$seq} = $row;
$cols{$_} += $row->{$_} for keys %$row;
}
my @cols = sort keys %cols;
my $col_sep = ' ';
my $fmt_h = "%-${max_len}s" . ("$col_sep%-${sub_seq_len}s" x @cols) .
+"\n";
my $fmt_b = "%-${max_len}s" . ("$col_sep%${sub_seq_len}d" x @cols) .
+"\n";
printf($fmt_h, '', @cols);
for my $seq (@seqs) {
my $row = $grid{$seq};
printf($fmt_b, $seq, map $_||0, @$row{@cols});
}
printf($fmt_b, 'Total', map $_||0, @cols{@cols});
ATGATCT ATGATGA ATGCGTA ATGCTAG ATGCTGT CGTATGC CTA
+GACT CTGTACT GATGATC GCGTATG GCTAGAC GCTGTAC GTATGCA TAGACTG
+TGATCTG TGATGAT TGCGTAT TGCTAGA TGCTGTA TGTACTG
ATGCTGTACTG 0 0 0 0 1 0
+ 0 1 0 0 0 1 0 0
+ 0 0 0 0 1 1
ATGATGATCTG 1 1 0 0 0 0
+ 0 0 1 0 0 0 0 0
+ 1 1 0 0 0 0
ATGCGTATGCA 0 0 1 0 0 1
+ 0 0 0 1 0 0 1 0
+ 0 0 1 0 0 0
ATGCTAGACTG 0 0 0 1 0 0
+ 1 0 0 0 1 0 0 1
+ 0 0 0 1 0 0
Total 1 1 1 1 1 1
+ 1 1 1 1 1 1 1 1
+ 1 1 1 1 1 1
If you really want to print all 16,384 columns, then replace
my @cols = sort keys %cols;
with
my @cols = glob('{A,C,G,T}' x $sub_seq_len);
It'll still use the minimum memory required.
Update: Added complete code.
Update: Made subsequence length variable instead of hard-coded.
Update: Added change needed to display all columns.
|