Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Why tries won't work for this application.

For 25-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's

  1. 0-miss (exact.)

    = 1

  2. 1-miss

    (CcM) * (A-1)M = 25c1 * (4-1)1 = 25 * 31 = 75.

  3. 2-miss

    (CcM) * (A-1)M = 25c2 * (4-1)2 = 300 * 32 = 2700.

Total 2776 keys. No problem building or with the performance of a Trie with 2776 keys.

Although the implementations currently on CPAN chew memory at a prodigious rate--a C implementation would be much less memory hungary as well as faster.

The problems.

To simplify the description of the problems, I'll start with using a 4-character key from an alphabet of ACGT.

Further, we'll take the OPs description and look for matches with 0, 1 or 2 mismatched characters.

For 4-chars (C) keys from a 4-char alphabet (A) with 0, 1 or 2 miss matches (M) that's

  1. 0-miss (exact)

    = 1

  2. 1-miss

    (CcM) * (A-1)M = 4c1 * (4-1)1 = 4 *31 = 12

  3. 2-miss

    (CcM) * (A-1)M = 4c2 * (4-1)2 = 6 * 32 = 54.

Total 67 keys.

Picking one of the 256 possible keys, I'll use 'ACGT'.

The keys we need to put into our trie in order to fuzzy match 'ACGT' are:

## 0-miss (exact) match. 1 unique ACGT ## 1-miss match: 4*4 = 16 - 4 duplicates of above = 12 unique ## With tags (°=exact; š=1-miss match; ˛=2-miss match) showing where ## a key resulted from multiple sources. v v v v ## The vs indicate the columns varying. ----------------------------- ACGT°˛ AAGT˛ ACAT˛ ACGA˛ CCGT˛ ACGT°˛ ACCT˛ ACGC˛ GCGT˛ AGGT˛ ACGT°˛ ACGG˛ TCGT˛ ATGT˛ ACTT˛ ACGT°˛ ## 2-miss match: 6*4*4 = 96 - ## ( 6 duplicates of 0-miss + ( 3 each duplicates * 12 1-miss ) ) ## With tags (°=exact; š=1-miss match; ˛=2-miss match) showing where ## a key resulted from multiple sources. ## = 54 unique. vv v v v v vv v v vv -------------------------------------------- AAGTš ACATš ACGAš AAAT AAGA ACAA CAGT CCAT CCGA ACATš˛ ACGAš˛ ACCA GAGT GCAT GCGA AGAT AGGA ACGAš˛ TAGT TCAT TCGA ATAT ATGA ACTA ACGT° ACCTš ACGCš AACT AAGC ACAC CCGTš CCCT CCGC ACCTš˛ ACGCš˛ ACCC GCGTš GCCT GCGC AGCT AGGC ACGCš˛ TCGTš TCCT TCGC ATCT ATGC ACTC AGGTš ACGT°˛ ACGGš AAGTš˛ AAGG ACAG CGGT CCGTš˛ CCGG ACGT°˛ ACGGš˛ ACCG GGGT GCGTš˛ GCGG AGGTš˛ AGGG ACGGš˛ TGGT TCGTš˛ TCGG ATGTš ATGG ACTG ATGTš˛ ACTTš ACGT°˛ AATT AAGTš˛ ACATš˛ CTGT CCTT CCGTš˛ ACTTš˛ ACGT°˛ ACCTš˛ GTGT GCTT GCGTš˛ AGTT AGGTš˛ ACGT°˛ TTGT TCTT TCGTš˛ ATTT ATGTš˛ ACTTš˛ ## Giving us our 67 unique keys.

Now it's no problem to use a more intelligent generation routine to avoid generating most of the duplicates.

Indeed, noticing that the brute-force, 2-miss generation process produces all of the keys required by the 0- and 1-miss cases, it would be a simple process skip those generation steps and simple de-dup the 2-miss set.

## 2-miss match generation with duplicates removed ## With tags (°=exact, š=1-miss match) showing where ## a key resulted from multiple sources. vv v v v v vv v v vv AAGTš ACATš ACGAš AAAT AAGA ACAA CAGT CCAT CCGA ACCA GAGT GCAT GCGA AGAT AGGA TAGT TCAT TCGA ATAT ATGA ACTA ACGT° ACCTš ACGCš AACT AAGC ACAC CCGTš CCCT CCGC ACCC GCGTš GCCT GCGC AGCT AGGC TCGTš TCCT TCGC ATCT ATGC ACTC AGGTš ACGGš AAGG ACAG CGGT CCGG ACCG GGGT GCGG AGGG TGGT TCGG ATGTš ATGG ACTG ACTTš AATT CTGT CCTT GTGT GCTT AGTT TTGT TCTT ATTT

In fact, provided your Trie implementation doesn't die or bellyache when you attempt to add a duplicate key, it will automatically perform the de-duping process for you.

And there's the rub

Once you start storing multiple sets of overlapping data into your Trie in order to make best use of it's performance, when you get a match, you can no longer distinguish whether this match occurred because it was an exact match, a 1-miss match or a 2-miss match.

So, even if you generate 1 Trie per key and apply them to each of the 976 25-char substrings in a 1000-char sequence 1 at a time (which is necessary using a Trie and has the beneficial side-effect of allowing you to know at what position within the sequence the match occured--which is a requirement), you still don't know why it matched.

Even if you arrange for the value associated with the key to be an array of tags associated with the sources of the key, that will only tell you the possible sources of the match, not the source.

So, whilst Tries are very good at doing fast lookups, and can be used for doing fuzzy lookups, the problem is that, like regex, they don't tell you what they matched.

With a regex, you can use the capture mechanism to discover this. Whilst this will slow the process somewhat, it is more than compensated for by the fact that the regex engine can be allowed to run through the entire string and locate every match and record where it found them (@- & @+), so the number of calls to the regex engine is 1/per sequence/per fuzzy match.

MegaTrie

Now it is possible to combine the keys generated from more than one exact key, into a single Trie. At least in theory, this would allow the 100,000 exact keys + all their fuzzy matches to be combined into one MegaTrie. This approximates a DFA, by solving all 100,000 * 2700 (2-miss) "Does it match?" questions, in one go, for each 25-char substring. Reducing the problem to 976 * 30,000 = 29,280,000 lookups (here defined as a transition from Perl into C and back). When compared with the 954,528,000,000,000 otherwise required by the 2-miss scenario, this sounds very inviting.

The problem is, that all you then know, is that one of the 100,000 exact matches, or one of their 270,000,000 possible fuzzy matches, did or did not match, each of the 29,280,000 substrings. So, you get to know something, (that the 25-char substring at position P of sequence S matched something)--very quickly--but not what (which of the 100,000), nor why (exact, 1-miss or 2--miss).

Fast pre-elimination?

At this point, I thought that maybe the MegaTrie idea could be used as a way of pre-eliminating some of the nearly 30 million substrings, prior to using (say) the XOR method to derive the required information. However, the fact that the (Mega)Trie needs to be applied individually to each substring--unlike a regex for example--and a Trie lookup involves either recursive or iterative sub-calls, it is slower than just using the XOR alone.

That means that there is no saving to be had using a Trie--even for this limited purpose.

Determanistic Finite-state Automaton

The ultimate expression of the Trie idea would be to write a DFA. Essentially this consists of a MegaTrie-like datastructure and a loop to apply that successively down the length of a string, substring-by-substring, reporting (the position(s)) of the matche(s) found. Much like the regex engine does, but failing early and moving on rather than backtracking*.

*It's worth noting that you can also use the 'cut' operator (?>...) to eliminate backtracking in the regex engine.)

This would reduce the number of lookups (again defined as a transition from Perl into C and back), to 30,000, though you still wouldn't know what matched, but the speed-up would offset the cost of further elimination.

The problems include:

  • All the DFA modules I look at on CPAN used hash-based objects to represent the DFA-states, and as such would require huge amounts of memory to capture the complexity of this problem.
  • Writing a C-based DFA is non trivial for a few, known states.

    But this problem involves a huge number of states, and those states are going to vary from run to run. So the task is not writing a single DFA, but that of writing a DFA generator. This is an order of magnitude more complex to do.

  • The shear volume of states involved means that it would be no good writing a poor implementation of a DFA and relying upon the natural speed of C to see you through.

    Not only would it need to be a fairly efficiently code DFA, it would also need to be highly optimised in its use of memory (notoriously difficult) in order that the huge number of states would fit into memory concurrently.

  • It would take a very highly skilled (C) programmer to write a DFA for this task that would out-perform the Perl regex engine and some suitably carefully crafted regex (using the cut operator).

    The regex engine has had a lot of very skilled programmers working on it for a very long time. If the OP is upto this task, he is in the wrong job as a biochemist (Assumption!).

    The gulf between what is theoretically possible and what is practical is very wide.

The point of all this?

Firstly, I'd done all the above work in order to verify my position anyway, and I thought someone might benefit from it.

Secondly, the next time someone calls you to question over something you have posted,

  1. Don't get so upset by the form of the message that you ignore the content.

    This is especially true of a private message between old sparing partners.

    Even if it says "Utter bollocks", which is not rude, but a colloquiallism for "Utter rubbish" or "Completely in error".

  2. Don't assume that you must be right and the other guy must be wrong.

    You might be in error.

    If it was worth posting, it is always worth a second (cool) assessment before you defend your post to the hilt.

  3. Don't conflate the matter to which the message relates, with other, older issues, long dead.
  4. Don't assume that because you have used the methods in question successfully on some other project, that the same methods will be applicable to the problem at hand.

    Even if they sound remarkably similar at first reading.

  5. Don't assume that your understanding of the CS issues involved are superior to the other guys understanding, no matter what history between you dictates.

    He may always have learnt from prior encounters and/or further learning in the interim.

  6. Don't expect the other guy to accept your statistics as proof.

    Especially when those statistics are produced using a code you cannot show, running on data that is completely unrelated to the problem at hand.

  7. Don't expect the other guy to produce code to validate your argument.

    Especially when he has already published analysis, code, and verifiable statistics.

    Especially when the code you are asking hm to produce would have to run against a system that you cannot show him.

    Especially when that system uses data that is completely unrelated to the problem under discussion.

    Especially when the information that system produces is completely inadequate for solving the problem at hand.

  8. Don't think that you can never make a mistake.

    There is nothing wrong with being human.

    But, if the result of your mistake could lead others into considerable time and effort engaging in fruitless explorations, based on your mistake, do make sure that you clearly identify them when you realise.

FWIW, most of these have been applicable to me in the past. I'm no saint (in the other sense of the word) as you well know.

Oh! And if you ask for clarifications of the problem, do give the other guy the chance to produce them before:

  • Raising the ante (by going public);
  • Engaging him in extended bouts of discussion that would be unnecessary or simplified by that clarification.
  • Firing back several, extended batches of dismissals, provarications and "fuck off you dickhead"s at him.

Thirdly, maybe a view through the "other guys eyes" of yesterday's conflagration will pursuade you to review your final act in our off-post communications?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re^3: Fuzzy Searching:Alg.Sel. (Why tries won't work for this application.) by BrowserUk
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (1)
As of 2024-04-25 01:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found