in reply to Find number of short words in long word

'AGTGTAT' would generate >16000 columns for every combination),

Much much less if you don't list all the *possible* combinations.

To find the subsequences:

sub scan_seq { local our %sub_seqs; $_[0] =~ /(.{7})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; }

All together:

use strict; use warnings; sub scan_seq { my $len = shift; local our %sub_seqs; use re 'eval'; $_[0] =~ /(.{$len})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; } my @seqs = qw( ATGCTGTACTG ATGATGATCTG ATGCGTATGCA ATGCTAGACTG ); my $sub_seq_len = 7; my $max_len = length('Total'); my %cols; my %grid; for my $seq (@seqs) { $max_len = length($seq) if length($seq) > $max_len; my $row = scan_seq($sub_seq_len, $seq); $grid{$seq} = $row; $cols{$_} += $row->{$_} for keys %$row; } my @cols = sort keys %cols; my $col_sep = ' '; my $fmt_h = "%-${max_len}s" . ("$col_sep%-${sub_seq_len}s" x @cols) . +"\n"; my $fmt_b = "%-${max_len}s" . ("$col_sep%${sub_seq_len}d" x @cols) . +"\n"; printf($fmt_h, '', @cols); for my $seq (@seqs) { my $row = $grid{$seq}; printf($fmt_b, $seq, map $_||0, @$row{@cols}); } printf($fmt_b, 'Total', map $_||0, @cols{@cols});
ATGATCT ATGATGA ATGCGTA ATGCTAG ATGCTGT CGTATGC CTA +GACT CTGTACT GATGATC GCGTATG GCTAGAC GCTGTAC GTATGCA TAGACTG +TGATCTG TGATGAT TGCGTAT TGCTAGA TGCTGTA TGTACTG ATGCTGTACTG 0 0 0 0 1 0 + 0 1 0 0 0 1 0 0 + 0 0 0 0 1 1 ATGATGATCTG 1 1 0 0 0 0 + 0 0 1 0 0 0 0 0 + 1 1 0 0 0 0 ATGCGTATGCA 0 0 1 0 0 1 + 0 0 0 1 0 0 1 0 + 0 0 1 0 0 0 ATGCTAGACTG 0 0 0 1 0 0 + 1 0 0 0 1 0 0 1 + 0 0 0 1 0 0 Total 1 1 1 1 1 1 + 1 1 1 1 1 1 1 1 + 1 1 1 1 1 1

If you really want to print all 16,384 columns, then replace

my @cols = sort keys %cols;
with
my @cols = glob('{A,C,G,T}' x $sub_seq_len);

It'll still use the minimum memory required.

Update: Added complete code.
Update: Made subsequence length variable instead of hard-coded.
Update: Added change needed to display all columns.

Replies are listed 'Best First'.
Re^2: Find number of short words in long word
by repellent (Priest) on Jul 15, 2009 at 09:07 UTC
    The regex version is nice. This is more explicit:
    sub scan_seq { my ($len, $seq) = @_; my %sub_seqs; if ($len > 0) { ++$sub_seqs{ substr($seq, $_, $len) } for 0 .. length($seq) - $len; } return \%sub_seqs; }

    Update: Changed local our to my.