comment on

Just in case Itatsumaki should happen back this way :)

Let's put this problem into a little perspective.

Using the regex engine

To use regex to test for all of:

An exact match.
Every possible 1-char mismatch.
Every possible 2-char mismatch.

Requires 1 + 25 (N) + 300 (25c2) = 326 different regexes.

Applying this mechanism to 1,000 25-ers, across 1,000 sequences of 1000-chars requires 326,000,000 calls to the regex engine^[1].

If Perl could invoke the regex engine 10 times / millsecond^[2], that would take just over 9 hours.

If you extend this to 100,000 25-ers, applied to 30,000 sequences of 1000-chars each, then the runtime will be just over 3 years!

If you double the number of 25-ers, you double the time.

If you double the average length of the strings, you slightly less than double the time.

However, if you allow for up to 3 mismatch characters, the number of regex required goes from 326 to 2626, and you have multiplied the time required by a factor of 8 giving the total time required to just under 25 years.

Allowing for 4 mismatches and you're in the region of 145 years.

Note: In this simplified scenario, no attempt has been made to record where the matches are found, nor which level of fuzziness matched.

Building an index of substrings

Starting with the premise that every 1-miss match must contain at least a 12-character exact match. Using the same 100,000 x 30,000 x 1000 scenario from above. To index all the 12 character substrings would require 1000 - 11 (989) x 30,000 index entries containing the 12-byte substring + 2-byte (min) sequence number + 2-byte (min) offset. That's 450 MB of index which should easily fit into memory on any half way decent machine.

But, in it's raw form, it doesn't buy you much. Even sorted, and using a binary chop to look up your data is going to take upto 19 chops to locate each item. At say 10 microsecond per chop (way optimistic) this doesn't sound too bad. 100,000 25-ers, break up into 14 x 12-char substrings each. Giving 14 * 100,000 * 19 = 266 seconds spent on lookup.

But what have we achieved? We may have succeeded in rejecting some of the sequences as not containing any of the 25-ers--but it is unlikely. The chances that a 1000 character string made up from A,C,G or T will contain any given 25-char substring is around 67 million times less likely than it will contain a given 12-char substring.

That's difficult to comprehend, but by reducing the length of the substring from 25 to 12, you are 67 million time less likely to be able to reject a given 1000-char sequence on the basis of it not containing the 12-char substring! Reduce the string to 7 or 8 chars (for the 2-misses match) and your effective rate of elimination drops through the floor.

And once we have located the index entry that corresponds to the 12-byte sequence, we will still need to use the sequence number/offset information to go out and read a 38 byte substring from the sequence in order to verify the match!

 1 2 3 4 5 6 7 8 9   1 2 3 4 5 6 7 8 9   1 2 3 4 5 6 7 8 9   1 2 3 4 5
+ 6 7 8 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
++-+-+-+
| | | | | | | | | | | | |?|X|X|X|X|X|X|X|X|X|X|X|X|?| | | | | | | | | 
+| | | |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
++-+-+-+
[download]

If the Xs represent our known match 12-chars, the other 13-chars of our match could be at either end. Hence the need to read 38 bytes from the sequence. But each of our 100,000 25-ers could fit around those 12-bytes in 14 different ways. That's fourteen different regexes we need to apply to that 38-byte substring (that needed to be read from a variable length file!) for every one of our 29,000,000+ 12 bytes substrings that matched.

Sorry, but the math involves probabilities here and the whole lot got beyond my ability to keep straight.

The upshot is that for even a 1-miss match, the index has done little or nothing to reduce the work required, and in many ways has increased it substaintially. By the time you get to 2-missed char matches, the numbers become astronomical.

This option goes nowhere fast.

"Use a database!"

Indexing via a database (of any form) simply exascerbates the above problems. The size of the database required to index a 30,000 x 1000-char sequences is huge. Building the indexes would take on the order of weeks, and the recall would be in the order of seconds. And for all that done, you still have to do most of the regex work yourself locally.

If anyone feels like contesting this conclusion, please show me the code--on realistically sized datasets. I may not be the worlds greatest DBA, but my limited attempts to explore the possibilities of using a DB and SQL to achieve this indicate that it would be 2 or 3 orders of magnitude slower than the flat files and regex solution above.

That is to say, I estimate that the 100,000 x 30,000 x 1000 x 1 mismatch scenario above would require 30 years of processing time. And that does not include the time required to build the DB.

AGREP and similar utilities

I spent an inordinate amount of time readng the help and playing with 2 different versions of agrep. My conclusion is that (as I indicated here), there is no way to utilise this tool for this purpose. It simply does not have any options to retrieve the sequence number/offset/fuzziness information required.

If anyone thinks they know better, I'd be all too pleased to see the working solution.

Stringy XOR

Applying the XOR method I describe elsewhere, the 1000 x 1000 x 1000 scenario I described takes 28.5 minutes.

[ 6:53:31.34] P:\test>406836.pl 406836.25s.1000 406836.seq.1000 >40683
+6.results
Loaded 1000 25-ers at P:\test\406836.pl line 17.
Processing sequence  1000 offset 01238

Processed 1000 sequences at P:\test\406836.pl line 48, <SEQ> line 1000
+.
Average length: 1016.119 at P:\test\406836.pl line 49, <SEQ> line 1000
+.
Total fuzzy comparisons: 992119000 at P:\test\406836.pl line 50, <SEQ>
+ line ...

[ 7:22:03.34] P:\test>
[download]

Code at end

That gives a time of 1.723 microseconds per 25x25 fuzzy comparison.

Applied to the 100,000 x 30,000 x 1,000 scenario, that would require 59 days, 9 hours.

That's just over 5% of the time (or 1,845% quicker) than the most optimistic estimate for the best of the above.

And the real trick is that if you increase the number of mis-matched characters to 3 or 10 or 20, the time required remains the same.

For the 3-mismatch scenario, that means 0.65 % of the runtime or 15,000 % faster.

For the 4-mismatch scenario, that means 0.11 % of the runtime or 89,000 % faster.

After that, the numbers begin to get silly :)

Further, I think that the XOR method could be improved. If the sequences and 25-ers are limited to the ACGT patterns, rather than containing other characters representing unknowns (Xs) and other letters representing proteins(?) and stuff, then you could pack 4 bases into each character and reduce the length of the strings by a factor of 4.

The downside is that you have more work to determine the fuzziness factor. So, maybe not.

BLAST and similar tools.

I'm not really qualified to comment as I do not understand the documentation for the BioPerl bundle. What I will say is that given the depth and complexity of the hierarchy of modules; the known overhead of method lookups combined with the less than sparkling performance of Perl's sub calling, I fail to see how anything in the bundle would come close to outperforming the regex engine and flat files.

Maybe some BioPerl afficionado will happen by and post a 20 line solution that outstrips everything here--but I installed the bundle a while ago, and I couldn't even work out where to start. Then again, I'm not a biochemist.

^[1]Yes. This number of individual calls to the regex engine could be reduced by combining the regexes into one big regex using PreSuf or similar tools, but given the nature of the regexes involved, the complexity of the output from combining 326 regexes into one big regex is likely to slow you down rather than speed things up.

To see what I mean, try generating the 326 regexes required for the 2-mismatches in a 25 character ACGT string. The OPs post has code to do this. The combine them into a single alternate-| regex. Then run your favorite regex optimiser. Then time a few iterations of the result against a 1000-char ACGT sequence.

^[2]This is very optimistic even for the simple cases:

timethis 10000, q[@m = ('acgt'x250 ) =~ m[(acgtacgtacgtacgtacgtacgta)]
+g ];
timethis 10000:  1 wallclock secs ( 1.16 usr +  0.00 sys =  1.16 CPU) 
+@ 8643.04/s (n=10000)
[download]

#! perl -slw
use strict;
use bytes;
$| = 1;

our $FUZZY ||= 2;

open FUZ, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!";
my %fuz;
while( <FUZ> ) {
    chomp;
    $fuz{ $_ } = '';
}

close FUZ;

warn "Loaded ${ \scalar keys %fuz } 25-ers";

open SEQ, '< :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!";

my $totalLen = 0;
my $fuzzyComps = 0;
while( my $seq = <SEQ> ) {
    chomp $seq;
    $totalLen += length $seq;
    for my $offset ( 0 .. length( $seq ) - 25 ) { 
        my $ssref =  \substr( $seq, $offset, 25 );
        printf STDERR "\rProcessing sequence %5d offset %05d", $., $of
+fset;

        for my $fuz ( keys %fuz ) {
            $fuzzyComps++;
            my $m =  25 - ( $fuz ^ $$ssref ) =~ tr[\0][\0];
                        
            if( $m <= $FUZZY ) {
                
                ## This stores the lineno/offset/fuzziness where each 
+25-er matched
                ## in a compact form for further process; sorting etc.
#                $fuz{ $fuz } .= pack 'nnn', $., $offset, $m;
                
                ## Or just print out the data to a file.
                print "Matched '$fuz' -v- '", 
                    $$ssref,
                "' in line: $. @ $offset with fuzziness: ", $m;
            }
        }
    }
}
warn "\n\nProcessed $. sequences";
warn "Average length: ", $totalLen / $.;
warn "Total fuzzy comparisons: ", $fuzzyComps;
close SEQ;
[download]

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection by BrowserUk
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Your skill will accomplish what the force of many cannot
	PerlMonks