Re: RFC: A call to bioinformationalists for some generic information.

My old job was basically the company BLAST-monkey, I could probably give you some specific help in a personal converstation. However, I havent ran any BLAST searches in the last couple of years that I could share.

A couple of general points though:

99.9% (<-- pure guess based on extenisve observations!) of BLAST searches are run with whatever defaults are set by the web-portal or command line.
The speed of BLAST and related programs has been "fast enough" now for many years. So any improovments would need to come with "better" results (e.g. more accurate sequence alignments) to get the field excited.
The Hamming distance, is not really applicable in this field (although as an intermediate pass the pre-filter a database of sequences it may have some use), as e-values are the cut-off most frequently used. However this depends largely on the reason of the search, and if you are looking for evolutionary related protein sequences or short stretches of highly similar DNA. For the latter, take a look at BLAT (https://genome.ucsc.edu/FAQ/FAQblat.html) which last time I checked was orders of magnitude faster than BLAST at these type of searches.

If you want me to cobble together a test datbase or two with some general & more tricky edge examples I could do.

This is not a Signature...

Replies are listed 'Best First'.

Re^2: RFC: A call to bioinformationalists for some generic information.
by BrowserUk (Patriarch) on May 29, 2015 at 04:19 UTC

99.9% of BLAST searches are run with whatever defaults are set by the web-portal or command line.

I don't understand the significance of that statement.
I've looked at the NCBI web BLAST submit screen, and I wouldn't know where to start in order to submit a "typical" request; nor how to interpret whatever results I might receive.
What I'm working on is not a substitute for everything that BLASTx does; but might be incorporated into BLASTx (or a BLASTx replacement), but that would need to be done by people who understand the field.
My algorithm is purely concerned with addressing the problem, (that has come up here many times over the last few years), of searching a very long string of a limited alphabet, for relatively short inputs (15-32 typical), and finding all the match sites with a specified number of mismatches.
The speed of BLAST and related programs has been "fast enough" now for many years. So any improvements would need to come with "better" results (e.g. more accurate sequence alignments) to get the field excited.

As I understand it, the way BLAST works is to build (or import a pre-built) index of short, fixed-sized exact matches -- typically minimum 7 for web-based searches -- and use that index to limit the number of positions at which exhaustive comparisons are made.
The down-side of the approach is that for shoter inputs with higher numbers of mismatches, some potential sites are never examined.
Ie. If looking for a 25-base input with 4 mismatches, potential match sites where the 4 mismatches are evenly distributed through the 25-bases: eg. ~....?....?....?....?.....~ will never be found, because none of the exact match bits is greater than or equal to the base index size.
My algorithm does not suffer this limitation; it finds all potential match sites regardless of the number of mismatches.
Moreover, the ratio of mismatches does not affect the performance in any significant fashion.
It could (for example) find *all* the 9-base sites with 8 mismatches; or 12 with 8 or 25 with 8 in the same time; and very quickly.
The Hamming distance, is not really applicable in this field ..., as e-values are the cut-off most frequently used.

As I understand E-values, they are a function of the makeup of the sequence being searched and the subsequence being searched for.
They are a statistical measure of the likelihood of a "random match", given the makeup of the subsequence being sought and the sequence being searched.
As such, E-values are not affected by the search algorithm used; thus whatever filtering heuristics are currently applied, would still need to be be applied.

What I'm getting from the similarities between: your response to my request; and a response I got to a request for information I emailed directly to the guys at the NCBI; is that the real problem is not finding match sites; but rather that of filtering the mass of match sites found to eliminate non-useful ones. And that is a process I do not understand the criteria for; and have no insights to offer.

Indeed, I'm approaching the conclusion that because my search algorithm would find *all* potential match sites; it might actually compound the filtering problem rather than help it.

So it looks like I may have a solution looking for a problem to solve on my hands.

Though I can't help but think that the potential for the "best" match site (however that might be assessed) being missed, because of the minimum index size (word-length), means that a lot of searches and pre- & post-filtering are being wasted.

I was hoping to have some basic performance numbers to post in this reply, but looking at my results a couple of hours ago I see an anomaly in the numbers coming out that I wasn't expecting, which could mean: a) my expectations were off; b) I've a bug in my code; c) the algorithm doesn't work.

I need to determine which of those is the case before I go posting "exciting numbers" that might be completely bogus.

Thank you for your reply. You've given me much to think about. If I get (back) to the point where I think I am ready to do comparisons, I'll /msg you.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

[reply]
[d/l]


XP is just a number
	PerlMonks