comment on

99.9% of BLAST searches are run with whatever defaults are set by the web-portal or command line.

I don't understand the significance of that statement.
I've looked at the NCBI web BLAST submit screen, and I wouldn't know where to start in order to submit a "typical" request; nor how to interpret whatever results I might receive.
What I'm working on is not a substitute for everything that BLASTx does; but might be incorporated into BLASTx (or a BLASTx replacement), but that would need to be done by people who understand the field.
My algorithm is purely concerned with addressing the problem, (that has come up here many times over the last few years), of searching a very long string of a limited alphabet, for relatively short inputs (15-32 typical), and finding all the match sites with a specified number of mismatches.
The speed of BLAST and related programs has been "fast enough" now for many years. So any improvements would need to come with "better" results (e.g. more accurate sequence alignments) to get the field excited.

As I understand it, the way BLAST works is to build (or import a pre-built) index of short, fixed-sized exact matches -- typically minimum 7 for web-based searches -- and use that index to limit the number of positions at which exhaustive comparisons are made.
The down-side of the approach is that for shoter inputs with higher numbers of mismatches, some potential sites are never examined.
Ie. If looking for a 25-base input with 4 mismatches, potential match sites where the 4 mismatches are evenly distributed through the 25-bases: eg. ~....?....?....?....?.....~ will never be found, because none of the exact match bits is greater than or equal to the base index size.
My algorithm does not suffer this limitation; it finds all potential match sites regardless of the number of mismatches.
Moreover, the ratio of mismatches does not affect the performance in any significant fashion.
It could (for example) find *all* the 9-base sites with 8 mismatches; or 12 with 8 or 25 with 8 in the same time; and very quickly.
The Hamming distance, is not really applicable in this field ..., as e-values are the cut-off most frequently used.

As I understand E-values, they are a function of the makeup of the sequence being searched and the subsequence being searched for.
They are a statistical measure of the likelihood of a "random match", given the makeup of the subsequence being sought and the sequence being searched.
As such, E-values are not affected by the search algorithm used; thus whatever filtering heuristics are currently applied, would still need to be be applied.

What I'm getting from the similarities between: your response to my request; and a response I got to a request for information I emailed directly to the guys at the NCBI; is that the real problem is not finding match sites; but rather that of filtering the mass of match sites found to eliminate non-useful ones. And that is a process I do not understand the criteria for; and have no insights to offer.

Indeed, I'm approaching the conclusion that because my search algorithm would find *all* potential match sites; it might actually compound the filtering problem rather than help it.

So it looks like I may have a solution looking for a problem to solve on my hands.

Though I can't help but think that the potential for the "best" match site (however that might be assessed) being missed, because of the minimum index size (word-length), means that a lot of searches and pre- & post-filtering are being wasted.

I was hoping to have some basic performance numbers to post in this reply, but looking at my results a couple of hours ago I see an anomaly in the numbers coming out that I wasn't expecting, which could mean: a) my expectations were off; b) I've a bug in my code; c) the algorithm doesn't work.

I need to determine which of those is the case before I go posting "exciting numbers" that might be completely bogus.

Thank you for your reply. You've given me much to think about. If I get (back) to the point where I think I am ready to do comparisons, I'll /msg you.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this

In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to Re^2: RFC: A call to bioinformationalists for some generic information. by BrowserUk
in thread RFC: A call to bioinformationalists for some generic information. by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Just another Perl shrine
	PerlMonks