In addition to what Eily wrote, please bear in mind that this is PerlMonks and not BioMonks: what's a k-mer?
A consultation of Wikipedia turned up this. IIUC, the 25-base sequence GGGGGGGGGGGGGGGGGGGGGGGGG has three overlapping 23-base k-mers ending in GG: the 23-base substrings starting at offsets 0, 1 and 2.
The cannonical regex approach to extracting overlapping patterns is this:
You want only unique k-mers and only the first thousand. One approach (assuming you have already extracted the entire, contiguous base sequence from each FASTA record):c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "my $seq = 'CCCAAAAAAAAAAAAAAAAAAAAAGGTTGGCCGGAAA'; my @k_mers = $seq =~ m{ (?= (.{21} GG)) }xmsg; print qq{'$_'} for @k_mers; " 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'
c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "my $seq = 'GGGGGGGGGGGGGGGGGGGGGGGGGGGCCCAAAAAAAAAAAAAAAAAAAAAGGTTGGC +CGGAAA'; ;; my $rx_kmer = qr{ .{21} GG }xms; ;; my @k_mers; my %seen; ;; KMER: while ($seq =~ m{ (?= ($rx_kmer)) }xmsg) { push @k_mers, $1 unless $seen{$1}++; last KMER if @k_mers >= 1000; } ;; print qq{'$_'} for @k_mers; " 'GGGGGGGGGGGGGGGGGGGGGGG' 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'
Update 1: Eliminated entirely unnecessary $count variable from final code example.
Update 2: Here's a slightly more elegant approach if you're not going to be extracting millions of k-mers (see List::MoreUtils::uniq):
c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "use List::MoreUtils qw(uniq); ;; my $seq = 'GGGGGGGGGGGGGGGGGGGGGGGGGGGCCCAAAAAAAAAAAAAAAAAAAAAGGTTGGC +CGGAAA'; ;; my $rx_kmer = qr{ .{21} GG }xms; ;; my @k_mers = uniq $seq =~ m{ (?= ($rx_kmer)) }xmsg; $#k_mers = 999 if $#k_mers >= 1000; ;; print qq{'$_'} for @k_mers; " 'GGGGGGGGGGGGGGGGGGGGGGG' 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'
Update 3: If you also need the offset of each extracted k-mer subsequence, there are ways to do that, too.
Give a man a fish: <%-{-{-{-<
In reply to Re: program to look for specific K-mer sequence
by AnomalousMonk
in thread program to look for specific K-mer sequence
by pearllearner315
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |