in reply to program to look for specific K-mer sequence

In addition to what Eily wrote, please bear in mind that this is PerlMonks and not BioMonks: what's a k-mer?

A consultation of Wikipedia turned up this. IIUC, the 25-base sequence  GGGGGGGGGGGGGGGGGGGGGGGGG has three overlapping 23-base k-mers ending in GG: the 23-base substrings starting at offsets 0, 1 and 2.

The cannonical regex approach to extracting overlapping patterns is this:

c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "my $seq = 'CCCAAAAAAAAAAAAAAAAAAAAAGGTTGGCCGGAAA'; my @k_mers = $seq =~ m{ (?= (.{21} GG)) }xmsg; print qq{'$_'} for @k_mers; " 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'
You want only unique k-mers and only the first thousand. One approach (assuming you have already extracted the entire, contiguous base sequence from each FASTA record):
c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "my $seq = 'GGGGGGGGGGGGGGGGGGGGGGGGGGGCCCAAAAAAAAAAAAAAAAAAAAAGGTTGGC +CGGAAA'; ;; my $rx_kmer = qr{ .{21} GG }xms; ;; my @k_mers; my %seen; ;; KMER: while ($seq =~ m{ (?= ($rx_kmer)) }xmsg) { push @k_mers, $1 unless $seen{$1}++; last KMER if @k_mers >= 1000; } ;; print qq{'$_'} for @k_mers; " 'GGGGGGGGGGGGGGGGGGGGGGG' 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'

Update 1: Eliminated entirely unnecessary  $count variable from final code example.

Update 2: Here's a slightly more elegant approach if you're not going to be extracting millions of k-mers (see List::MoreUtils::uniq):

c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "use List::MoreUtils qw(uniq); ;; my $seq = 'GGGGGGGGGGGGGGGGGGGGGGGGGGGCCCAAAAAAAAAAAAAAAAAAAAAGGTTGGC +CGGAAA'; ;; my $rx_kmer = qr{ .{21} GG }xms; ;; my @k_mers = uniq $seq =~ m{ (?= ($rx_kmer)) }xmsg; $#k_mers = 999 if $#k_mers >= 1000; ;; print qq{'$_'} for @k_mers; " 'GGGGGGGGGGGGGGGGGGGGGGG' 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'

Update 3: If you also need the offset of each extracted k-mer subsequence, there are ways to do that, too.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: program to look for specific K-mer sequence
by LanX (Saint) on Apr 20, 2017 at 17:56 UTC