http://qs321.pair.com?node_id=240652


in reply to Inefficient regex

This:
(.*?)\s*(.*?)\s*
can be problematic. Since LABEL and PARAMETER must not contain spaces, try replacing that portion with something like:
([^\s]+)\s+([^\s]+)?

I'd also suggest that you change *? to simply *, since * denotes 0 or more occurances.

Probably the biggest problem is the:
(.+?)
which occurs just before:
<%end_\1%>
and matches to the end of the string, which then causes the regex engine to backtrack each time it cannot find each "end" tag.

You may also want to take a look at the book "Mastering Regular Expressions" by Jeffrey E. Friedl. See Mastering Regular Expressions for a review.

Replies are listed 'Best First'.
Re^2: Inefficient regex (death to dot star)
by tye (Sage) on Mar 05, 2003 at 19:45 UTC

    Exactly. In more detail, what the original regex ends up doing goes like this:

    <%start_(.*?)\s*(.*?)\s*%>\n?(.+?)<%end_\1%>\n? $1- -A- $2- -B- $3- C- Quickly matches <%start_ Tries $1 matching "" Tries A matching "" Tries $2 matching "wibble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C Tries $1 matching "w" Tries A matching "" Tries $2 matching "ibble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C Tries $1 matching "wi" Tries A matching "" Tries $2 matching "bble XXX" Tries $3 matches to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Is forced to backtrack by C ... Tries $1 matching "wibble" Tries A matching " " Tries $2 matching "XXX" Tries $3 matching to just before "<%end_flub%>" Is forced to backtrack by C Tries $3 matches to just before "<%end_wibble%>" Succeeds finding one match
    which you can see wastes a lot of time.

    That is one reason why Death to dot star! suggests you use character class instead of . whenever possible in a regex.

                    - tye
      This is great and very clear - thank you!! I've taken on board what you said, experimented by making LABEL long then short to prove the point and sure enough execution times vary wildy. I've altered the regex to now use (\S+) to pick up the LABEL and it now completes in sub-second time. Thanks again!!!
Re: Re: Inefficient regex
by Thelonius (Priest) on Mar 05, 2003 at 19:40 UTC
    Note:\S+ is the same as [^\s+]
      Thelonius, your note is not correct:

      \s+ - matches one or more space characters
      [^\s+] - matches one character that is not a space nor a '+'
      [^\s]+ - (what you may have meant) matches one or more non-space characters.