laziness, impatience, and hubris | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Hi,
I have a matching problem that i would like your input on. It does not need to be code input but more of, how you would approach it. What I have is a set of paired strings such that sometimes a suffix of a string A is a prefix of a string B. When I say sometimes i mean that this is not always the case. Moreover, the suffix and the prefix need not to be perfect matches, that is, up to X number of mismatches are allowed. Therefore, in case a suffix and prefix do not match, the longest common substring is equal to X (however, such a match is of no value). To illustrate the problem better, consider the following example:
Given the X=1 suffix baba of A is the longest prefix of B. In order to find such matches i constructed a quite naive algorithm to solve it : Which outputs: Maybe the code has a bug in it, but currently i do not see it... The point is to start comparing strings inwards (increasing the size of a compared suffix/prefix match).Clearly this algorithm in O(|A|x|B|) and in case strings are random one could expect the runtime to be closer to linear since the X cutoff is expected to be reached frequently. In reality I am expecting to process 700-900 mil pairs each of size 20 - 2000 characters delivered in batches (files) of 10 mil pairs (that means i have a context of 10 mil pairs that i currently see no value in because the pairs are supposed to be unrelated, aside from the fact that i can process them in parallel). Does anyone know a smarter (faster) way to solve this matching problem ? What i also considered was KMP implementation, but i am not sure whether the preprocessing is worth the time and I am not quite sure how to go about the growing pattern length and allowed mismatches. Anyways, thank you for any or none input provided :) In reply to Suffix-prefix matching done right by baxy77bax
|
|