comment on

Sadly I can't use BLAST for this case because BLAST does not account for "fuzzy matching". I'll explain that in a second, but in case anyone isn't familiar with it, let me describe BLAST a bit.

BLAST = Basic Local Alignment Search Tool. It is a way of solving the general problem of aligning two sequences. There is an optimal solution to the alignment of two sequences (a variation of dynamic programming, termed Smith-Waterman alignment). Unfortunately that solution is slow (O(n2), I believe). BLAST is much faster at aligning a single sequence against a database of sequences. What it does is search each member of the databases for short, exact matches from the query sequence. If it does not find these matches, it does not attempt full dynamic programming on that sequence. In other words, BLAST makes the assumption that two sequences with high similarity will have short subsequences of extremely high similarity. One of the big advantages of BLAST is that the statistics of BLAST searching can be estimated (it approximates an extreme-value distribution for the general case). This allows the assignation of a probability value to a similarity, and biologists like p-values.

The scoring system of BLAST is based on two things: a penalty for introducing a gap, and a scoring matrix that gives the +ve score for any given match and the -ve score for any given mismatch. This system is not flexible enough to do the kind of searching I am looking at for three reasons:

Most BLAST software cannot handle matching of N's (an N is an ambiguous base -- it could be any of the other four
No BLAST software can handle matching of all other ambiguity codes (e.g. Y for purines, R for pyrimidines, etc.)
BLAST can't handle "variable" gaps -- for instance one couldn't duplicate the search /GC.{10,13}AT/ in a BLAST

So, unfortunately this is a problem out of the scope of sequence alignments. Also, sequence alignment tools generally work well with large sequences, not with 10-20 letter ones. Some types of BLAST will actually be unable to perform alignments with sequences below 7 or 8 letters.

Sorry for the long explanation, but it's a really good idea and I wanted to explain why it wouldn't work for my situation

-Tats

In reply to Re: Re: Performance Tuning: Searching Long-Sequence Permutations by Itatsumaki
in thread Performance Tuning: Searching Long-Sequence Permutations by Itatsumaki

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.