Re: A New, Faster Regular Expression Engine? the devil is in the details.

hola all

i have checked the papers by dr Gonnet et al, and want to clear up a number of things re: the algorithms that monsieur champs rightly draws our attention to. i believe these points should attenuate somewhat the enthusiasm for the dream, idealist tho i am.

1) the speedups pertaining to regular expressions achieved by Gonnet et al distinguish static from dynamic text. this is particularly important for us, as the speedups attained are attained because the text is 'preprocessed', or indexed, into a patricia trie, prior to regex matching.

these speedups can be significant if the text you are using is (like a genome) static. if you are dealing with dynamic text, i assume any speedups you might have achieved lost in the frequent reindexing that would be required for the methods of gonnet et al to work. gonnet's example is that of matching against the OED; think to yourself, how many of my regexen are used against this corpus. if the answer is yes, then vote for the new dream by all means. if your answers are more varied, you may prefer to read... point 2:

2) regardless of the time aspect of the problem in 1) the patricia tries used take up significant amounts of memory. i quote from Gonnet: 'for some applications, patricia tries still need too much space'. Do you want to build this unpredictability into perl's robust, pretty fast, hand-tailored regex engine? i would rather not, even if someone else does all the hard work ;-).and also...

3) the paper explicitly states that ' finding an algorithm with logarithmic search time for any RE query is still an open problem'. so was monsiuer champs's friend was wrong? undoubtedly not - for the special text (singular) he considered. the metrics to decide whether *your* text will also do this are... alas, not there, at the mo. to the best of my knowledge. the worst case for the algorithm based on searching a patricia trie (i distinguish this from text) is linear. the algorithm performs less well in the presence of repeats - interesting, given the prevalence of repeats in genomic data. and <gasp>, lastly...

4) i think it's worth distinguishing the above from approximate matching techniques, some of which are, to the best of my knowledge, *not* very fast (esp. the dynamic prog ramming based ones, which give you the best results imho).

nb: i think it's great that the subject above was brought up, it's interesting & thought provoking. i find myself compelled to be conservative with what i know & love, however. match that! ;-)

...wufnik

-- in the world of the mules there are no rules --

Comment on Re: A New, Faster Regular Expression Engine? the devil is in the details.


Keep It Simple, Stupid
	PerlMonks