sedm1000 has asked for the wisdom of the Perl Monks concerning the following question:

As a humble seeker of assistance... I am looking to generate a script that returns the frequency of a short word in a longer original (a DNA sequence).... I have a one-liner for generating all possible short words of a certain length (k), e.g.;
perl -se '$,="\t"; print split / /, qx/echo @{["{$a}" x $k]}/;' -- -k= +4 -a='A,T,G,C'
This gives 256 tab delimited short DNA sequences. I would like to use these as column names, and score the frequency of each short DNA sequence in a set of longer sequences, e.g.
ACTG ATTG ATGC etc ATGCTGTACTG 1 0 1 ATGATGATCTG 0 0 0 ATGCGTATGCA 0 0 1 ATGCTAGACTG 1 0 1
The files are >20000 lines, and it would be nice to scale to larger short word sizes (i.e. 7 letter words such as'AGTGTAT' would generate >16000 columns for every combination), so efficiency is fairly important. Unfortunately I haven't been able to find a ready made example to do precisely what is required. Any guidance to this would be much appreciated. Thanks.

Replies are listed 'Best First'.
Re: Find number of short words in long word
by ikegami (Patriarch) on Jul 14, 2009 at 20:19 UTC

    'AGTGTAT' would generate >16000 columns for every combination),

    Much much less if you don't list all the *possible* combinations.

    To find the subsequences:

    sub scan_seq { local our %sub_seqs; $_[0] =~ /(.{7})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; }

    All together:

    use strict; use warnings; sub scan_seq { my $len = shift; local our %sub_seqs; use re 'eval'; $_[0] =~ /(.{$len})(?{ ++$sub_seqs{$1} })(?!)/; return \%sub_seqs; } my @seqs = qw( ATGCTGTACTG ATGATGATCTG ATGCGTATGCA ATGCTAGACTG ); my $sub_seq_len = 7; my $max_len = length('Total'); my %cols; my %grid; for my $seq (@seqs) { $max_len = length($seq) if length($seq) > $max_len; my $row = scan_seq($sub_seq_len, $seq); $grid{$seq} = $row; $cols{$_} += $row->{$_} for keys %$row; } my @cols = sort keys %cols; my $col_sep = ' '; my $fmt_h = "%-${max_len}s" . ("$col_sep%-${sub_seq_len}s" x @cols) . +"\n"; my $fmt_b = "%-${max_len}s" . ("$col_sep%${sub_seq_len}d" x @cols) . +"\n"; printf($fmt_h, '', @cols); for my $seq (@seqs) { my $row = $grid{$seq}; printf($fmt_b, $seq, map $_||0, @$row{@cols}); } printf($fmt_b, 'Total', map $_||0, @cols{@cols});
    ATGATCT ATGATGA ATGCGTA ATGCTAG ATGCTGT CGTATGC CTA +GACT CTGTACT GATGATC GCGTATG GCTAGAC GCTGTAC GTATGCA TAGACTG +TGATCTG TGATGAT TGCGTAT TGCTAGA TGCTGTA TGTACTG ATGCTGTACTG 0 0 0 0 1 0 + 0 1 0 0 0 1 0 0 + 0 0 0 0 1 1 ATGATGATCTG 1 1 0 0 0 0 + 0 0 1 0 0 0 0 0 + 1 1 0 0 0 0 ATGCGTATGCA 0 0 1 0 0 1 + 0 0 0 1 0 0 1 0 + 0 0 1 0 0 0 ATGCTAGACTG 0 0 0 1 0 0 + 1 0 0 0 1 0 0 1 + 0 0 0 1 0 0 Total 1 1 1 1 1 1 + 1 1 1 1 1 1 1 1 + 1 1 1 1 1 1

    If you really want to print all 16,384 columns, then replace

    my @cols = sort keys %cols;
    with
    my @cols = glob('{A,C,G,T}' x $sub_seq_len);

    It'll still use the minimum memory required.

    Update: Added complete code.
    Update: Made subsequence length variable instead of hard-coded.
    Update: Added change needed to display all columns.

      The regex version is nice. This is more explicit:
      sub scan_seq { my ($len, $seq) = @_; my %sub_seqs; if ($len > 0) { ++$sub_seqs{ substr($seq, $_, $len) } for 0 .. length($seq) - $len; } return \%sub_seqs; }

      Update: Changed local our to my.
Re: Find number of short words in long word
by ikegami (Patriarch) on Jul 14, 2009 at 21:09 UTC
      Cool. Thanks, that does the trick. I'm always amazed by how quickly people here can come up with an answer I have spent days working on...
Re: Find number of short words in long word
by SuicideJunkie (Vicar) on Jul 14, 2009 at 20:20 UTC

    The =()= 'operator' should do:

    my $count =()= 'abracadabra' =~ /br/g; print "Found $count instances\n" # prints Found 2 instances
    Big Update:
    Based on all the comments below, this might be handy for automatically generating the lookaheads that allow overlapping results based on a search string:
    sub countOverlappingMatches { my $stringRef = shift; #Likely to be huge, don't make a copy my $patternRef = shift; my $count =()= $$stringRef =~ /(?=$$patternRef)/g; return $count; } print countOverlappingMatches(\'abrabrabrabra', \'rabra'); # 3
    Or extract the one active line that is left in there.
    for my $pattern (@listOfPatterns) { my $count =()= $string =~ /(?=$pattern)/g; # Play with $count }
      my $count =()= 'AAAAA' =~ /AAAA/g; print "Found $count instances\n"

      It prints 1, but I believe the OP wants 2. More specifically, from ATGCTGTACTG, I believe the OP wants

      ACTG: 1 ATGC: 1 CTGT: 1 GCTG: 1 GTAC: 1 TACT: 1 TGCT: 1 TGTA: 1

      I don't know if it is important for the OP, but your code doesn't find overlapping instances:

      my $count =()= 'abrabrabrabra' =~ /brabra/g; print "Found $count instances\n" # prints Found 2 instances (not 3)

      Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

        ... and if it's something you need to fix, you can do so easily by packing everything but the first character into a look-ahead:
        my $count =()= 'abrabrabrabra' =~ /b(?=rabra)/g;

        Of course in this case you know that the first possible overlap starts at the second be, so even /bra(?=bra)/ works fine.

        (I haven't benchmarked it, but I suppose that non-look-around literals are a bit faster, due to optimizations regarding the match length).

        Thanks greatly for the start on this... All *possible* combinations are important - certainly over 20K sequences they'll likely all appear at least a couple of times. Overlapping instances are important - so "AGCTGT" would need to be scored;
        AGCT GCTG CTGT TGTA etc. AGCTGT 1 1 1 0
        and so on...