'AGTGTAT' would generate >16000 columns for every combination),

Much much less if you don't list all the *possible* combinations.

To find the subsequences:

sub scan_seq { local our %sub_seqs; $_[0] =~ /(.{7})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; }

All together:

use strict; use warnings; sub scan_seq { my $len = shift; local our %sub_seqs; use re 'eval'; $_[0] =~ /(.{$len})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; } my @seqs = qw( ATGCTGTACTG ATGATGATCTG ATGCGTATGCA ATGCTAGACTG ); my $sub_seq_len = 7; my $max_len = length('Total'); my %cols; my %grid; for my $seq (@seqs) { $max_len = length($seq) if length($seq) > $max_len; my $row = scan_seq($sub_seq_len, $seq); $grid{$seq} = $row; $cols{$_} += $row->{$_} for keys %$row; } my @cols = sort keys %cols; my $col_sep = ' '; my $fmt_h = "%-${max_len}s" . ("$col_sep%-${sub_seq_len}s" x @cols) . +"\n"; my $fmt_b = "%-${max_len}s" . ("$col_sep%${sub_seq_len}d" x @cols) . +"\n"; printf($fmt_h, '', @cols); for my $seq (@seqs) { my $row = $grid{$seq}; printf($fmt_b, $seq, map $_||0, @$row{@cols}); } printf($fmt_b, 'Total', map $_||0, @cols{@cols});
ATGATCT ATGATGA ATGCGTA ATGCTAG ATGCTGT CGTATGC CTA +GACT CTGTACT GATGATC GCGTATG GCTAGAC GCTGTAC GTATGCA TAGACTG +TGATCTG TGATGAT TGCGTAT TGCTAGA TGCTGTA TGTACTG ATGCTGTACTG 0 0 0 0 1 0 + 0 1 0 0 0 1 0 0 + 0 0 0 0 1 1 ATGATGATCTG 1 1 0 0 0 0 + 0 0 1 0 0 0 0 0 + 1 1 0 0 0 0 ATGCGTATGCA 0 0 1 0 0 1 + 0 0 0 1 0 0 1 0 + 0 0 1 0 0 0 ATGCTAGACTG 0 0 0 1 0 0 + 1 0 0 0 1 0 0 1 + 0 0 0 1 0 0 Total 1 1 1 1 1 1 + 1 1 1 1 1 1 1 1 + 1 1 1 1 1 1

If you really want to print all 16,384 columns, then replace

my @cols = sort keys %cols;
with
my @cols = glob('{A,C,G,T}' x $sub_seq_len);

It'll still use the minimum memory required.

Update: Added complete code.
Update: Made subsequence length variable instead of hard-coded.
Update: Added change needed to display all columns.


In reply to Re: Find number of short words in long word by ikegami
in thread Find number of short words in long word by sedm1000

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.