http://qs321.pair.com?node_id=571564


in reply to Fuzzy String matching with index?

Inexact string matching is still an active field of research both on the algorithmic and statistical fronts. Some of the more popular algorithms in bioinformatics index the strings and use heuristics to speed up the search. There are also optimizations tied to particular hardware features such as vector units and FPGAs. Most of the programs (eg. BLAST, BLAT, CrossMatch, Exonerate, FASTA, etc.) are written in C with a few exceptions (eg. PatternHunter uses Java).

The concept of an edit distance is useful if you assume that all symbols are equally probable and substitutable, and there are no context dependencies. Biological sequences are not strings, however, and if one is looking for something biological, it often behooves one to use a stochastic grammar rather than a regular one. This is why hidden Markov models have become so popular. In protein space, HMMER is a popular software package (also written in C). Not many people use HMMs in nucleotide space, but they should.

These books might help

"Biological Sequence Analysis: A Probabilistic Approach": Durbin, Eddy, Krogh, and Mitchison, Cambridge University Press
"BLAST": Korf, Yandell, and Bedell, O'Reilly & Associates
"Algorithms on Strings, Trees, and Sequences": Gusfield, Cambridge University Press