comment on

Hello,

I used BioPerl, Bio::SeqIO to load the fasta sequences all into a long string. Then I tested for kmers that only appeared once and printed them out.

The results agree with your desired output.

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;

my $in  = Bio::SeqIO->new( -file   => "dmel-all-chromosome-r6.18.fasta
+" ,
                           -format => 'fasta');
my $fasta;
while ( my $seq = $in->next_seq() ) {
    $fasta .= $seq->seq;
}

my %seen;

for my $i (0 .. length($fasta) - 21) {
    my $kmer =  substr $fasta, $i, 21;
    next unless substr($kmer, -2) eq 'GG';
    my $match = substr($kmer, -12);
    $seen{$match}{count}++;
    $seen{$match}{kmer} = $kmer;
}

my $crispr;
for my $key (keys %seen) {
    next unless $seen{$key}{count} == 1;
    print "crispr_", ++$crispr, "\n";
    print $seen{$key}{kmer}, "\n";    
}

__END__
*** ouput

crispr_1
TTTAGACTCCCCTTGTACAGG
crispr_2
TCTTCAGTCTCCAGTCTCCGG
crispr_3
TTGCGTTGCGGAGCATACTGG
crispr_4
TGCCACCAGTGGTTCCAAGGG
crispr_5
TTATGTTTGTACGAGGGGGGG
crispr_6
TCTCTTTGGTTTACGGATGGG
crispr_7
TTGGCAAGGAGACGGTCCTGG
crispr_8
TGAATTAAAGCTTGCGCGAGG
crispr_9
GGAAGAGGCATCAACGAGGGG
crispr_10
TGCAGCGGCCTAACAAGGCGG
crispr_11
CTGCCCGATCCTAACTCCAGG
crispr_12
ATATATGTTTGACCGTCGGGG
...
crispr_126892
TTGCTTGGCACTAAGGCGGGG
crispr_126893
CACCAAAAAGGACTTGCGTGG
crispr_126894
GTGCCCCTCACTCATGCGGGG
crispr_126895
TAAAAAGCGACGCAGTATTGG
crispr_126896
CCGGATTTCTTCGTACAGGGG
crispr_126897
GGTGGCTATGCTATGGTACGG
crispr_126898
CTGCGTTGATGTTAGGTAGGG
crispr_126899
GCTGGGACCCGAATACGTAGG
[download]

This program runs in under 2 minutes with the string of fasta characters about 145M. From your specs it looks like you want to combine the sequences into one string to test for kmers that only occur 1 time.

Update: Just realized how odd it was for my results to agree with the ones you posted. The order of my results were in the (un)ordered keys from the hash. It is curious why the order I got agreed with the order in your post!

Also, I think the reason you were getting 1 million results rather than 250,000 is you are testing for uniqueness on the whole 21 char windowsize instead of testing just the last 12 chars (including the 'GG' ending). There would be more unique 21 char kmers than unique 12 char kmers.

In reply to Re: unique sequences by Cristoforo
in thread unique sequences by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.