go ahead... be a heretic | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
The best approach may simply be a straight string scan fail fast approach. You can get this far more optimized than the RE engine could manage. It should certainly be in C for speed. Breaking the dictionary into all 25 char long patterns serves no useful purpose - it simply increases the disk I/O by a factor of 25x. Ideally you would want the entire dictionary in memory anyway so you totally avoid any disk I/O bottlenecks. The type of indexing you may or may not be able to do depends on the data. You can form a useful index based on the fact, that as you note, even an off by 2 match must have at least 7 consecutive matching chars - in fact I beleive it is impossible to have less than 8 as 25-2 = 23 Thus the worst case min lengths possible are 7+8+8=23 so you must have at least 2 8+ char stings that match. If you only have 2 possible values in your strings the probability of randonly finding 8 consecutive chars in a given order is 2^8 or 1/256, If you are talking 4 chars ie CGAT then it is 1/65536. This should dramatically shrink the search space as you only need to scan 17 chars upstream and downstream of this sequence. Each search string can be broken into 18 8 char possibles. The problem with this may be (assuming CGAT alphabet) and 100 search strings you will have nearly 1800 unique index strings giving an average spacing in the search string of 65536/1800 ~ 36 chars. Now we have to search 17+8+17 = 42 chars so in a nutshell you still probably have to scan the whole search string every time. The advantage is limited to having to perform the scan on a limited subset of the find strings. There is of course a price for the hashing and lookup. If the alphabet (A) is larger than 4 chars then you should see very significant benefits due the to relative scarcity of A^8 substrings. I would quite possibly be worth precomputing the hash values for all the 8 char strings in your find an search spaces to speed your many to many search. By way of the worst case scenario here is a brute force fail fast
cheers tachyon In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection
by tachyon
|
|