In addition to what Eily wrote, please bear in mind that this is PerlMonks and not BioMonks: what's a k-mer?

A consultation of Wikipedia turned up this. IIUC, the 25-base sequence  GGGGGGGGGGGGGGGGGGGGGGGGG has three overlapping 23-base k-mers ending in GG: the 23-base substrings starting at offsets 0, 1 and 2.

The cannonical regex approach to extracting overlapping patterns is this:

c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "my $seq = 'CCCAAAAAAAAAAAAAAAAAAAAAGGTTGGCCGGAAA'; my @k_mers = $seq =~ m{ (?= (.{21} GG)) }xmsg; print qq{'$_'} for @k_mers; " 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'
You want only unique k-mers and only the first thousand. One approach (assuming you have already extracted the entire, contiguous base sequence from each FASTA record):
c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "my $seq = 'GGGGGGGGGGGGGGGGGGGGGGGGGGGCCCAAAAAAAAAAAAAAAAAAAAAGGTTGGC +CGGAAA'; ;; my $rx_kmer = qr{ .{21} GG }xms; ;; my @k_mers; my %seen; ;; KMER: while ($seq =~ m{ (?= ($rx_kmer)) }xmsg) { push @k_mers, $1 unless $seen{$1}++; last KMER if @k_mers >= 1000; } ;; print qq{'$_'} for @k_mers; " 'GGGGGGGGGGGGGGGGGGGGGGG' 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'

Update 1: Eliminated entirely unnecessary  $count variable from final code example.

Update 2: Here's a slightly more elegant approach if you're not going to be extracting millions of k-mers (see List::MoreUtils::uniq):

c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "use List::MoreUtils qw(uniq); ;; my $seq = 'GGGGGGGGGGGGGGGGGGGGGGGGGGGCCCAAAAAAAAAAAAAAAAAAAAAGGTTGGC +CGGAAA'; ;; my $rx_kmer = qr{ .{21} GG }xms; ;; my @k_mers = uniq $seq =~ m{ (?= ($rx_kmer)) }xmsg; $#k_mers = 999 if $#k_mers >= 1000; ;; print qq{'$_'} for @k_mers; " 'GGGGGGGGGGGGGGGGGGGGGGG' 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'

Update 3: If you also need the offset of each extracted k-mer subsequence, there are ways to do that, too.


Give a man a fish:  <%-{-{-{-<


In reply to Re: program to look for specific K-mer sequence by AnomalousMonk
in thread program to look for specific K-mer sequence by pearllearner315

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.