Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Fuzzy Searching: Optimizing Algorithm Selection

by BrowserUk (Patriarch)
on Nov 12, 2004 at 02:33 UTC ( [id://407248]=note: print w/replies, xml ) Need Help??


in reply to Fuzzy Searching: Optimizing Algorithm Selection

Just in case Itatsumaki should happen back this way :)

Let's put this problem into a little perspective.

Using the regex engine

To use regex to test for all of:

  • An exact match.
  • Every possible 1-char mismatch.
  • Every possible 2-char mismatch.

Requires 1 + 25 (N) + 300 (25c2) = 326 different regexes.

Applying this mechanism to 1,000 25-ers, across 1,000 sequences of 1000-chars requires 326,000,000 calls to the regex engine[1].

If Perl could invoke the regex engine 10 times / millsecond[2], that would take just over 9 hours.

If you extend this to 100,000 25-ers, applied to 30,000 sequences of 1000-chars each, then the runtime will be just over 3 years!

If you double the number of 25-ers, you double the time.

If you double the average length of the strings, you slightly less than double the time.

However, if you allow for up to 3 mismatch characters, the number of regex required goes from 326 to 2626, and you have multiplied the time required by a factor of 8 giving the total time required to just under 25 years.

Allowing for 4 mismatches and you're in the region of 145 years.

Note: In this simplified scenario, no attempt has been made to record where the matches are found, nor which level of fuzziness matched.


Building an index of substrings

Starting with the premise that every 1-miss match must contain at least a 12-character exact match. Using the same 100,000 x 30,000 x 1000 scenario from above. To index all the 12 character substrings would require 1000 - 11 (989) x 30,000 index entries containing the 12-byte substring + 2-byte (min) sequence number + 2-byte (min) offset. That's 450 MB of index which should easily fit into memory on any half way decent machine.

But, in it's raw form, it doesn't buy you much. Even sorted, and using a binary chop to look up your data is going to take upto 19 chops to locate each item. At say 10 microsecond per chop (way optimistic) this doesn't sound too bad. 100,000 25-ers, break up into 14 x 12-char substrings each. Giving 14 * 100,000 * 19 = 266 seconds spent on lookup.

But what have we achieved? We may have succeeded in rejecting some of the sequences as not containing any of the 25-ers--but it is unlikely. The chances that a 1000 character string made up from A,C,G or T will contain any given 25-char substring is around 67 million times less likely than it will contain a given 12-char substring.

That's difficult to comprehend, but by reducing the length of the substring from 25 to 12, you are 67 million time less likely to be able to reject a given 1000-char sequence on the basis of it not containing the 12-char substring! Reduce the string to 7 or 8 chars (for the 2-misses match) and your effective rate of elimination drops through the floor.

And once we have located the index entry that corresponds to the 12-byte sequence, we will still need to use the sequence number/offset information to go out and read a 38 byte substring from the sequence in order to verify the match!

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 + 6 7 8 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ++-+-+-+ | | | | | | | | | | | | |?|X|X|X|X|X|X|X|X|X|X|X|X|?| | | | | | | | | +| | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ++-+-+-+

If the Xs represent our known match 12-chars, the other 13-chars of our match could be at either end. Hence the need to read 38 bytes from the sequence. But each of our 100,000 25-ers could fit around those 12-bytes in 14 different ways. That's fourteen different regexes we need to apply to that 38-byte substring (that needed to be read from a variable length file!) for every one of our 29,000,000+ 12 bytes substrings that matched.

Sorry, but the math involves probabilities here and the whole lot got beyond my ability to keep straight.

The upshot is that for even a 1-miss match, the index has done little or nothing to reduce the work required, and in many ways has increased it substaintially. By the time you get to 2-missed char matches, the numbers become astronomical.

This option goes nowhere fast.


"Use a database!"

Indexing via a database (of any form) simply exascerbates the above problems. The size of the database required to index a 30,000 x 1000-char sequences is huge. Building the indexes would take on the order of weeks, and the recall would be in the order of seconds. And for all that done, you still have to do most of the regex work yourself locally.

If anyone feels like contesting this conclusion, please show me the code--on realistically sized datasets. I may not be the worlds greatest DBA, but my limited attempts to explore the possibilities of using a DB and SQL to achieve this indicate that it would be 2 or 3 orders of magnitude slower than the flat files and regex solution above.

That is to say, I estimate that the 100,000 x 30,000 x 1000 x 1 mismatch scenario above would require 30 years of processing time. And that does not include the time required to build the DB.


AGREP and similar utilities

I spent an inordinate amount of time readng the help and playing with 2 different versions of agrep. My conclusion is that (as I indicated here), there is no way to utilise this tool for this purpose. It simply does not have any options to retrieve the sequence number/offset/fuzziness information required.

If anyone thinks they know better, I'd be all too pleased to see the working solution.


Stringy XOR

Applying the XOR method I describe elsewhere, the 1000 x 1000 x 1000 scenario I described takes 28.5 minutes.

[ 6:53:31.34] P:\test>406836.pl 406836.25s.1000 406836.seq.1000 >40683 +6.results Loaded 1000 25-ers at P:\test\406836.pl line 17. Processing sequence 1000 offset 01238 Processed 1000 sequences at P:\test\406836.pl line 48, <SEQ> line 1000 +. Average length: 1016.119 at P:\test\406836.pl line 49, <SEQ> line 1000 +. Total fuzzy comparisons: 992119000 at P:\test\406836.pl line 50, <SEQ> + line ... [ 7:22:03.34] P:\test>

Code at end

That gives a time of 1.723 microseconds per 25x25 fuzzy comparison.

Applied to the 100,000 x 30,000 x 1,000 scenario, that would require 59 days, 9 hours.

That's just over 5% of the time (or 1,845% quicker) than the most optimistic estimate for the best of the above.

And the real trick is that if you increase the number of mis-matched characters to 3 or 10 or 20, the time required remains the same.

For the 3-mismatch scenario, that means 0.65 % of the runtime or 15,000 % faster.

For the 4-mismatch scenario, that means 0.11 % of the runtime or 89,000 % faster.

After that, the numbers begin to get silly :)

Further, I think that the XOR method could be improved. If the sequences and 25-ers are limited to the ACGT patterns, rather than containing other characters representing unknowns (Xs) and other letters representing proteins(?) and stuff, then you could pack 4 bases into each character and reduce the length of the strings by a factor of 4.

The downside is that you have more work to determine the fuzziness factor. So, maybe not.


BLAST and similar tools.

I'm not really qualified to comment as I do not understand the documentation for the BioPerl bundle. What I will say is that given the depth and complexity of the hierarchy of modules; the known overhead of method lookups combined with the less than sparkling performance of Perl's sub calling, I fail to see how anything in the bundle would come close to outperforming the regex engine and flat files.

Maybe some BioPerl afficionado will happen by and post a 20 line solution that outstrips everything here--but I installed the bundle a while ago, and I couldn't even work out where to start. Then again, I'm not a biochemist.


[1]Yes. This number of individual calls to the regex engine could be reduced by combining the regexes into one big regex using PreSuf or similar tools, but given the nature of the regexes involved, the complexity of the output from combining 326 regexes into one big regex is likely to slow you down rather than speed things up.

To see what I mean, try generating the 326 regexes required for the 2-mismatches in a 25 character ACGT string. The OPs post has code to do this. The combine them into a single alternate-| regex. Then run your favorite regex optimiser. Then time a few iterations of the result against a 1000-char ACGT sequence.

[2]This is very optimistic even for the simple cases:

timethis 10000, q[@m = ('acgt'x250 ) =~ m[(acgtacgtacgtacgtacgtacgta)] +g ]; timethis 10000: 1 wallclock secs ( 1.16 usr + 0.00 sys = 1.16 CPU) +@ 8643.04/s (n=10000)
#! perl -slw use strict; use bytes; $| = 1; our $FUZZY ||= 2; open FUZ, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my %fuz; while( <FUZ> ) { chomp; $fuz{ $_ } = ''; } close FUZ; warn "Loaded ${ \scalar keys %fuz } 25-ers"; open SEQ, '< :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; my $totalLen = 0; my $fuzzyComps = 0; while( my $seq = <SEQ> ) { chomp $seq; $totalLen += length $seq; for my $offset ( 0 .. length( $seq ) - 25 ) { my $ssref = \substr( $seq, $offset, 25 ); printf STDERR "\rProcessing sequence %5d offset %05d", $., $of +fset; for my $fuz ( keys %fuz ) { $fuzzyComps++; my $m = 25 - ( $fuz ^ $$ssref ) =~ tr[\0][\0]; if( $m <= $FUZZY ) { ## This stores the lineno/offset/fuzziness where each +25-er matched ## in a compact form for further process; sorting etc. # $fuz{ $fuz } .= pack 'nnn', $., $offset, $m; ## Or just print out the data to a file. print "Matched '$fuz' -v- '", $$ssref, "' in line: $. @ $offset with fuzziness: ", $m; } } } } warn "\n\nProcessed $. sequences"; warn "Average length: ", $totalLen / $.; warn "Total fuzzy comparisons: ", $fuzzyComps; close SEQ;

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^2: Fuzzy Searching: Optimizing Algorithm Selection
by hv (Prior) on Nov 12, 2004 at 11:48 UTC

    If you double the number of 25-ers, you double the time.

    Yes.

    If you double the average length of the strings, you slightly less than double the time.

    No: 1 + 50C1 + 50C2 = 1 + 50 + 1225 = 1276, nearly 4 times the 326 different regexps required for a length-25 string.

    Of course there are two ways of allowing for fuzziness: replace /x/ with /[^x]/, or replace it with /./; the latter approach means you don't also need the 0- and 1-mismatch patterns, leaving 300 patterns (for length-25, up to 2 errors) or 1225 (for length-50).

    Hugo

      No: 1 + 50C1 + 50C2 = 1 + 50 + 1225 = 1276, nearly 4 times the 326 different regexps required for a length-25 string.

      Sorry, I was lapse in my language. Instead of "...length of the strings...", I should have said sequences. That is, if you double the length of the sequences being searched, from 1000 to 2000 chars (average), but search for the same 25-char needle, then the time taken is slightly less that double.

      I agree if the length of the needles double, the number of regexes required close to quadruples.

      As I understand the problem, using /./ for the fuzziness is the right thing. However, whilst using a regex with 2 wildcards will match anythng that the same regex with only one wildcard would match, the problem with discarding the latter is that you will no longer be able to record how fuzzy the match was.

      Other than perhaps capturing the wildcards and then using substr to determine which of the wildcards would have matched had it not been wild. Overall, the slowdown through capturing + the additional work determining the fuzz factor, it is probably quicker to retain the 1-mis and 2-mis regexes?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re^2: Fuzzy Searching: Optimizing Algorithm Selection
by husker (Chaplain) on Nov 12, 2004 at 15:02 UTC
    Fantastic analysis!

    I think it's interesting that the XOR method, which appears to be the most efficient, and widens it's lead as the problem space expands, corresponds more closely to how nature itself "searches" and "parses" amino acid strings than any of the other methods. Our sensibilities ought to help us realize that if millions of years of evolution have selected a solution to a problem, there must be something useful there.

Re^2: Fuzzy Searching: Optimizing Algorithm Selection
by demerphq (Chancellor) on Nov 12, 2004 at 19:57 UTC

    Well BrowserUk, i just finished some testing of my approach. I was able to search a dictionary of 30 million words for one of 1245 possible fuzzy matches in 30 minutes on 555 mhz pc. Admittedly this was anchored at the first char not a match "in string" as it were.

    Considering your extremely harsh comments in the CB id be very happy to run whatever code you choose against my code to see which is faster. The code takes as an argument a string of digits and then finds all 0 1 or 2 char mismatches of that string in the dictionary with the correct number of digits. If you provide code so that I can do the equivelent of

    my $match=$trie->fetch($num);

    Then ill run your code against mine with the same test set, etc and we will see whos solution is faster.

    Since my solution represents a special case of the general problem yours should perform even better than you have suggested here so I expect that youll win. But until you have dont you think you should keep your comments at least polite?

    ---
    demerphq

      Why tries won't work for this application.

      For 25-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's

      1. 0-miss (exact.)

        = 1

      2. 1-miss

        (CcM) * (A-1)M = 25c1 * (4-1)1 = 25 * 31 = 75.

      3. 2-miss

        (CcM) * (A-1)M = 25c2 * (4-1)2 = 300 * 32 = 2700.

      Total 2776 keys. No problem building or with the performance of a Trie with 2776 keys.

      Although the implementations currently on CPAN chew memory at a prodigious rate--a C implementation would be much less memory hungary as well as faster.

      The problems.

      To simplify the description of the problems, I'll start with using a 4-character key from an alphabet of ACGT.

      Further, we'll take the OPs description and look for matches with 0, 1 or 2 mismatched characters.

      For 4-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's

      1. 0-miss (exact)

        = 1

      2. 1-miss

        (CcM) * (A-1)M = 4c1 * (4-1)1 = 4 *31 = 12

      3. 2-miss

        (CcM) * (A-1)M = 4c2 * (4-1)2 = 6 * 32 = 54.

      Total 67 keys.

      Picking one of the 256 possible keys, I'll use 'ACGT'.

      The keys we need to put into our trie in order to fuzzy match 'ACGT' are:

      ## 0-miss (exact) match. 1 unique ACGT ## 1-miss match: 4*4 = 16 - 4 duplicates of above = 12 unique ## With tags (°=exact; š=1-miss match; ˛=2-miss match) showing where ## a key resulted from multiple sources. v v v v ## The vs indicate the columns varying. ----------------------------- ACGT°˛ AAGT˛ ACAT˛ ACGA˛ CCGT˛ ACGT°˛ ACCT˛ ACGC˛ GCGT˛ AGGT˛ ACGT°˛ ACGG˛ TCGT˛ ATGT˛ ACTT˛ ACGT°˛ ## 2-miss match: 6*4*4 = 96 - ## ( 6 duplicates of 0-miss + ( 3 each duplicates * 12 1-miss ) ) ## With tags (°=exact; š=1-miss match; ˛=2-miss match) showing where ## a key resulted from multiple sources. ## = 54 unique. vv v v v v vv v v vv -------------------------------------------- AAGTš ACATš ACGAš AAAT AAGA ACAA CAGT CCAT CCGA ACATš˛ ACGAš˛ ACCA GAGT GCAT GCGA AGAT AGGA ACGAš˛ TAGT TCAT TCGA ATAT ATGA ACTA ACGT° ACCTš ACGCš AACT AAGC ACAC CCGTš CCCT CCGC ACCTš˛ ACGCš˛ ACCC GCGTš GCCT GCGC AGCT AGGC ACGCš˛ TCGTš TCCT TCGC ATCT ATGC ACTC AGGTš ACGT°˛ ACGGš AAGTš˛ AAGG ACAG CGGT CCGTš˛ CCGG ACGT°˛ ACGGš˛ ACCG GGGT GCGTš˛ GCGG AGGTš˛ AGGG ACGGš˛ TGGT TCGTš˛ TCGG ATGTš ATGG ACTG ATGTš˛ ACTTš ACGT°˛ AATT AAGTš˛ ACATš˛ CTGT CCTT CCGTš˛ ACTTš˛ ACGT°˛ ACCTš˛ GTGT GCTT GCGTš˛ AGTT AGGTš˛ ACGT°˛ TTGT TCTT TCGTš˛ ATTT ATGTš˛ ACTTš˛ ## Giving us our 67 unique keys.

      Now it's no problem to use a more intelligent generation routine to avoid generating most of the duplicates.

      Indeed, noticing that the brute-force, 2-miss generation process produces all of the keys required by the 0- and 1-miss cases, it would be a simple process skip those generation steps and simple de-dup the 2-miss set.

      ## 2-miss match generation with duplicates removed ## With tags (°=exact, š=1-miss match) showing where ## a key resulted from multiple sources. vv v v v v vv v v vv AAGTš ACATš ACGAš AAAT AAGA ACAA CAGT CCAT CCGA ACCA GAGT GCAT GCGA AGAT AGGA TAGT TCAT TCGA ATAT ATGA ACTA ACGT° ACCTš ACGCš AACT AAGC ACAC CCGTš CCCT CCGC ACCC GCGTš GCCT GCGC AGCT AGGC TCGTš TCCT TCGC ATCT ATGC ACTC AGGTš ACGGš AAGG ACAG CGGT CCGG ACCG GGGT GCGG AGGG TGGT TCGG ATGTš ATGG ACTG ACTTš AATT CTGT CCTT GTGT GCTT AGTT TTGT TCTT ATTT

      In fact, provided your Trie implementation doesn't die or bellyache when you attempt to add a duplicate key, it will automatically perform the de-duping process for you.

      And there's the rub

      Once you start storing multiple sets of overlapping data into your Trie in order to make best use of it's performance, when you get a match, you can no longer distinguish whether this match occurred because it was an exact match, a 1-miss match or a 2-miss match.

      So, even if you generate 1 Trie per key and apply them to each of the 976 25-char substrings in a 1000-char sequence 1 at a time (which is necessary using a Trie and has the beneficial side-effect of allowing you to know at what position within the sequence the match occured--which is a requirement), you still don't know why it matched.

      Even if you arrange for the value associated with the key to be an array of tags associated with the sources of the key, that will only tell you the possible sources of the match, not the source.

      So, whilst Tries are very good at doing fast lookups, and can be used for doing fuzzy lookups, the problem is that, like regex, they don't tell you what they matched.

      With a regex, you can use the capture mechanism to discover this. Whilst this will slow the process somewhat, it is more than compensated for by the fact that the regex engine can be allowed to run through the entire string and locate every match and record where it found them (@- & @+), so the number of calls to the regex engine is 1/per sequence/per fuzzy match.

      MegaTrie

      Now it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting.

      The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly--but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss).

      Fast pre-elimination?

      At this point, I thought that maybe the MegaTrie idea could be used as a way of pre-eliminating some of the nearly 30 million substrings, prior to using (say) the XOR method to derive the required information. However, the fact that the (Mega)Trie needs to be applied individually to each substring--unlike a regex for example--and a Trie lookup involves either recursive or iterative sub-calls, it is slower than just using the XOR alone.

      That means that there is no saving to be had using a Trie--even for this limited purpose.

      Determanistic Finite-state Automaton

      The ultimate expression of the Trie idea would be to write a DFA. Essentially this consists of a MegaTrie-like datastructure and a loop to apply that successively down the length of a string, substring-by-substring, reporting (the position(s)) of the matche(s) found. Much like the regex engine does, but failing early and moving on rather than backtracking*.

      *It's worth noting that you can also use the 'cut' operator (?>...) to eliminate backtracking in the regex engine.)

      This would reduce the number of lookups (again defined as a transition from Perl into C and back), to 30,000, though you still wouldn't know what matched, but the speed-up would offset the cost of further elimination.

      The problems include:

      • All the DFA modules I look at on CPAN used hash-based objects to represent the DFA-states, and as such would require huge amounts of memory to capture the complexity of this problem.
      • Writing a C-based DFA is non trivial for a few, known states.

        But this problem involves a huge number of states, and those states are going to vary from run to run. So the task is not writing a single DFA, but that of writing a DFA generator. This is an order of magnitude more complex to do.

      • The shear volume of states involved means that it would be no good writing a poor implementation of a DFA and relying upon the natural speed of C to see you through.

        Not only would it need to be a fairly efficiently code DFA, it would also need to be highly optimised in its use of memory (notoriously difficult) in order that the huge number of states would fit into memory concurrently.

      • It would take a very highly skilled (C) programmer to write a DFA for this task that would out-perform the Perl regex engine and some suitably carefully crafted regex (using the cut operator).

        The regex engine has had a lot of very skilled programmers working on it for a very long time. If the OP is upto this task, he is in the wrong job as a biochemist (Assumption!).

        The gulf between what is theoretically possible and what is practical is very wide.

      The point of all this?

      Firstly, I'd done all the above work in order to verify my position anyway, and I thought someone might benefit from it.

      Secondly, the next time someone calls you to question over something you have posted,

      1. Don't get so upset by the form of the message that you ignore the content.

        This is especially true of a private message between old sparing partners.

        Even if it says "Utter bollocks", which is not rude, but a colloquiallism for "Utter rubbish" or "Completely in error".

      2. Don't assume that you must be right and the other guy must be wrong.

        You might be in error.

        If it was worth posting, it is always worth a second (cool) assessment before you defend your post to the hilt.

      3. Don't conflate the matter to which the message relates, with other, older issues, long dead.
      4. Don't assume that because you have used the methods in question successfully on some other project, that the same methods will be applicable to the problem at hand.

        Even if they sound remarkably similar at first reading.

      5. Don't assume that your understanding of the CS issues involved are superior to the other guys understanding, no matter what history between you dictates.

        He may always have learnt from prior encounters and/or further learning in the interim.

      6. Don't expect the other guy to accept your statistics as proof.

        Especially when those statistics are produced using a code you cannot show, running on data that is completely unrelated to the problem at hand.

      7. Don't expect the other guy to produce code to validate your argument.

        Especially when he has already published analysis, code, and verifiable statistics.

        Especially when the code you are asking hm to produce would have to run against a system that you cannot show him.

        Especially when that system uses data that is completely unrelated to the problem under discussion.

        Especially when the information that system produces is completely inadequate for solving the problem at hand.

      8. Don't think that you can never make a mistake.

        There is nothing wrong with being human.

        But, if the result of your mistake could lead others into considerable time and effort engaging in fruitless explorations, based on your mistake, do make sure that you clearly identify them when you realise.

      FWIW, most of these have been applicable to me in the past. I'm no saint (in the other sense of the word) as you well know.

      Oh! And if you ask for clarifications of the problem, do give the other guy the chance to produce them before:

      • Raising the ante (by going public);
      • Engaging him in extended bouts of discussion that would be unnecessary or simplified by that clarification.
      • Firing back several, extended batches of dismissals, provarications and "fuck off you dickhead"s at him.

      Thirdly, maybe a view through the "other guys eyes" of yesterday's conflagration will pursuade you to review your final act in our off-post communications?


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

        Well BrowserUk im afraid the numbers just dont agree with you at all.

        I put together a hybrid Trie/DFA solution, (code is in the readmore at the bottom of this node). It shows quite clearly that the Trie/DFA will win in many cases against your XOR approach. Even though my trie implementation is Pure Perl (with an extremely inefficient DFA construction routine) it outperforms your algorithm in every circumstance I tested it in _except_ the 25/1000/1000 tests that you were using. And even then when I make the string being searched just 10 times larger XOR loses outright. Also any possible benefits that may accrue from OS level file caching goes to benefit the XOR algorithm as because the Trie algorithm most often finished first I put it first in the test code. I didnt have sufficient tuits to redo the implementation using Hybrid Perl/Inline::C which would have just increased the difference between the routines.

        Basically the performance for the Trie/DFA is nonlinear for construction, and linear for the actual scan. This of course means that once constructed the search time for any given piece of data will theoretically be some constant times the length of the string and in practice we see some pretty much that with some modest change probably due to memory management costs. Alternate and more compact representations could be chosen to make this more efficient. Since the constructed Trie can be serialized for reuse the construction costs could in many situations be essentially ignored. For this reason i seperated out the time actually running the state machine from the time constructing it.

        I will admit that there are two subtle differences between the results from the two approaches. Yours will show mutliple hits for a single offset if they exist wheras mine will only show one. I dont consider this in any way to make your solution superior as it is a fairly trivial issue to resolve as one could simply maintain a hash of all the keys in the trie and any other keys that are anagrams of that key. Similarly this particular implementation doesnt necessarily handle strings of different sizes as one might expect. As this is meant to be a proof of concept and the implementation you posted is hardcoded to handle 25 char words and as such suffers the same problem I didn't see that as a show stopper.

        I should say lastly that calling this a DFA not be entirely correct. Possibly a better term would FSA (finate state automata). The reason being that normally neither NFA or DFA regex engines will match overlapping words. Ie if the string is "ABCABD" and we have 'BC' and 'CA' as accepting strings then it would match ONLY "BC". My implementation doesnt care about that and would match both. Regardless there are any number of ways to build a state machine for the purpose at hand and mine is probably one of the least efficient.

        The following table shows the various test scenarios that I used. (All times are in seconds.) The last column shows how much longer the XOR solution took than the Trie solution. Where the number is negative (and in bold) is where the XOR based solution beats the Trie based soltion. You'll notice that only one scenario is in this category: that being when the strings are 1000 chars and the words are 25 chars. A ratio of 1:40. However once we use a ratio of 1:100 or so this reverses itself quite drammatically.

        The bottom line here is that if you have the resources, long term you are going to be better off with a technique like this than a technique like yours. For something like searching vast quantities of DNA strings I imagine a technique like this would do quite well. OTOH this is a trade off. I can imagine scenarios where your approach would be better. But for hardcore pattern matching on modern machines with lots of ram id be looking at a DFA. Notice that by the time we start talking about 100k being searched for 10 words the trie runs in 26 minutes less. And this is Pure Perl and in the opinion of the author a pig. :-)


        A footnote: you mentioned a number of things about a DFA/Trie not being able to do. I suggest to you that you were mostly wrong. Adding meta data to states in a DFA is relatively easy to do. For instance accepting states can have associated data (in fact in the implementation above its the very presence of that data that indicates its an accepting state.)

        ---
        demerphq

        Since all that a DFA has to keep track of is the position within the test string and the number of fuzzy matches made so far, it would need no more than (number of characters in the test string)*(number of fuzzies allowed + 1) states. That is only 75 states for a string of length 25 with two errors allowed.

        I haven't coded the DFA yet, but here is looping code that illustrates the idea.

      demerphq You are wrong.

      How long were your "words"? Less than 25 characters?

      For the two-miss scenario, matching each of 100,000 25-character needles against each of 30,000 x 1000-char strings requires:

      326 * 100,000 * 976 * 30,000 comparisons = 954,528,000,000,000 comparisons.

      Your 30,000,000 * 1275 = 38,250,000,000 comparisons.

      Your rate of comparisons is 21,250,000/second.

      Your runtime to run the 100,000 x 30,000 x 1000 is:

      1 year, 5 months, 4 days, 21 hours, 29 minutes, 24.7 seconds.

      For the 3-miss scenario the numbers are:

      2626 * 100,000 * 976 * 30,000 = 7,688,928,000,000,000.

      Your runtime to run the 100,000 x 30,000 x 1000 is:

      11 years, 5 months, 20 days, 2 hours, 51 minutes, 45.88 seconds.

      For the 4-miss scenario the numbers are:

      28,252 * 100,000 * 976 * 30,000 = 82,721,856,000,000,000.

      Your runtime to run the 100,000 x 30,000 x 1000 is:

      123 years, 4 months, 9 days, 17 hours, 27 minutes, 3.46 seconds.

      As for your challenge. I have published my code. Where is yours? Try applying your method to real world data.

      If you have somewhere I can post the 1000 x 1000-char randomly generated sequences (994 K) and the 1000 x 25-char randomly generated search strings (27 kb) then I will send them to you so that you can time the process of producing the required information of:

      Sequence no/ offset/ fuzziness (number of mismatched characters) that my published test code produces in 28.5 minutes.

      Then, and only then, when we are comparing eggs with eggs, will there be any point in continuing this discussion.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

        Im not at liberty to post my code. Sorry. All I can say is what I already have said, it is an Inline::C implementation of a Patricia tree optimized for handling a 10 character alphabet.

        Regarding all your bollocks in this reply, first off i was very clear in my node that i misunderstood the question and that my proposals only handled the issue of finding exact matches of the fuzzy word list. Not finding substrings. Since that is what you insulted and rubbished me about then that is what i am talking about.

        I have already stated that I ran my tests against live data (35 million telephone numbers) and that it took 0.00006 seconds per telephone numbers, or 1 million lookups per minute, or the full 35 million in just over 30 minutes. (This includes the overhead of reading the file and splitting the records by tabs.)

        Now comparing your algorithm against mine will mean that you only have to test to see if there is a fuzzy match from the first character. So if you want to provide a subroutine that i can run over my test data set then we are in business. But im not going to help you write it, and in fact im not going to do any more than run it to see. As it is I should be working on speeding up the site not sparring with someone who doesnt treat me with the decency I afford to him.

        ---
        demerphq

        A reply falls below the community's threshold of quality. You may see it by logging in.

        Your math is all wrong.

        Your 30,000,000 * 1275 = 38,250,000,000 comparisons.

        I used an 11 digit key to generate my fuzzy list of 1275 words (i took a phone book and picked a number and generated all the one and two digit variants then searched against the book, the book was just under 35,000,000 numbers). Since the number of comparisons in a lookup into a Trie is the same as number of characters in the matched key the number of comparisons at absolute top used by the Trie based algorithm is thus 11 * N where N is the number of words in the dictionary. Thus the absolute maximum number of comparisons to search a 35 million word dictionary is 385 million character comparisons, in practice it will be signifgiantly less because of early bailouts.

        Oh another cool thing about the trie approach is that it could do fuzzy matches on a whole host of strings simultaneously for exactly the same work. So the op said he needs to run a large set of these matches. He could generate his fuzzy list for each word walk the dictionary and have matched the full set in one go.

        All of this assumes the thing i got wrong in the first place: that the match is start point anchored. I wasnt paying attention, and when i realized when I did my resubmit preview I added the bit starting "gah". Since i thought the ideas raised were interesting, and the qualifier present I didn't think it worth editing further. Maybe next time ill put in huge blinking letters "OOPS -- I ANSWERED THE WRONG QUESTION" so its more clear.

        ---
        demerphq

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://407248]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-03-28 17:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found