Re: Find number of short words in long word

'AGTGTAT' would generate >16000 columns for every combination),

Much much less if you don't list all the *possible* combinations.

To find the subsequences:

sub scan_seq {
    local our %sub_seqs;
    $_[0] =~ /(.{7})(?{ ++$sub_seqs{$1} })(?!)/;
    return \%sub_seqs;
}
[download]

All together:

use strict;
use warnings;

sub scan_seq {
    my $len = shift;
    local our %sub_seqs;
    use re 'eval';
    $_[0] =~ /(.{$len})(?{ ++$sub_seqs{$1} })(?!)/;
    return \%sub_seqs;
}

my @seqs = qw(
    ATGCTGTACTG
    ATGATGATCTG
    ATGCGTATGCA
    ATGCTAGACTG
);

my $sub_seq_len = 7;

my $max_len = length('Total');
my %cols;
my %grid;
for my $seq (@seqs) {
    $max_len = length($seq) if length($seq) > $max_len;
    my $row = scan_seq($sub_seq_len, $seq);
    $grid{$seq} = $row;
    $cols{$_} += $row->{$_} for keys %$row;
}

my @cols = sort keys %cols;

my $col_sep = '  ';
my $fmt_h = "%-${max_len}s" . ("$col_sep%-${sub_seq_len}s" x @cols) . 
+"\n";
my $fmt_b = "%-${max_len}s" . ("$col_sep%${sub_seq_len}d"  x @cols) . 
+"\n";

printf($fmt_h, '', @cols);
for my $seq (@seqs) {
    my $row = $grid{$seq};
    printf($fmt_b, $seq, map $_||0, @$row{@cols});
}
printf($fmt_b, 'Total', map $_||0, @cols{@cols});
[download]

             ATGATCT  ATGATGA  ATGCGTA  ATGCTAG  ATGCTGT  CGTATGC  CTA
+GACT  CTGTACT  GATGATC  GCGTATG  GCTAGAC  GCTGTAC  GTATGCA  TAGACTG  
+TGATCTG  TGATGAT  TGCGTAT  TGCTAGA  TGCTGTA  TGTACTG
ATGCTGTACTG        0        0        0        0        1        0     
+   0        1        0        0        0        1        0        0  
+      0        0        0        0        1        1
ATGATGATCTG        1        1        0        0        0        0     
+   0        0        1        0        0        0        0        0  
+      1        1        0        0        0        0
ATGCGTATGCA        0        0        1        0        0        1     
+   0        0        0        1        0        0        1        0  
+      0        0        1        0        0        0
ATGCTAGACTG        0        0        0        1        0        0     
+   1        0        0        0        1        0        0        1  
+      0        0        0        1        0        0
Total              1        1        1        1        1        1     
+   1        1        1        1        1        1        1        1  
+      1        1        1        1        1        1
[download]

If you really want to print all 16,384 columns, then replace

my @cols = sort keys %cols;
[download]

with

my @cols = glob('{A,C,G,T}' x $sub_seq_len);
[download]

It'll still use the minimum memory required.

Update: Added complete code.
Update: Made subsequence length variable instead of hard-coded.
Update: Added change needed to display all columns.

Comment on Re: Find number of short words in long word Select or Download Code

Replies are listed 'Best First'.
Re^2: Find number of short words in long word by repellent (Priest) on Jul 15, 2009 at 09:07 UTC
The regex version is nice. This is more explicit: `sub scan_seq { my ($len, $seq) = @_; my %sub_seqs; if ($len > 0) { ++$sub_seqs{ substr($seq, $_, $len) } for 0 .. length($seq) - $len; } return \%sub_seqs; }` [download] Update: Changed `local our` to `my`.	[reply] [d/l] [select]