Re: Speed of regex on compiled perl under windows

The slowness is actually an effect of the regex. The way your regex is constructed, the regex engine potentially has to do a lot of backtracking to try to find a match.

Here's a similar example that demonstrates the same problem: qq{"The quick brown fox jumps over the lazy dog\n"} =~ /("(\w+| )*")/ The (\w+| )* part can match the word 'Just' in many ways: ('Just'), or ('Jus', 't'), or ('Ju', 'st'), or ('Ju', 's', 't'), or... Each time the regex engine gets to the newline and fails to match the second quote, it backtracks and tries another way of matching the words. It's the nested quantifiers that get you.

The solution is to restructure the regex so that it can only match a part of the string in a limited number of ways, to eliminate all the useless backtracking. (Very easy in this case, since the regex is so simple.) qq{"The quick brown fox jumps over the lazy dog\n"} =~ /("[\w ]*")/

This is what you did when you moved the space inside the character class and removed the nested quantifiers. Here's one way to fix your regex, without changing the semantics: (?:\w[\.\w\-\'\!\(\)\/]* +)*\w[\.\w\-\'\!\(\)\/]* Each iteration of (?:\w[\.\w\-\'\!\(\)\/]* +)* has to match at least one word character, followed by at least one space. There's only one way for this regex to match a string.

As perl's regex engine has been improved, various optimizations have been added to avoid this exponential backtracking problem. That's probably why your code ran so much faster on Unix; I expect you were using 5.6.0 or 5.6.1 there. My simple example shows the same behavior, returning immediately in 5.6.1 and taking a loooong time to finish in 5.005_03.

Jeffrey Friedl discusses this technique, which he calls "unrolling the loop", in Mastering Regular Expressions.

Comment on Re: Speed of regex on compiled perl under windows Select or Download Code


Don't ask to ask, just ask
	PerlMonks