Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Death to Dot Star!

by Ovid (Cardinal)
on Jul 27, 2000 at 11:27 UTC ( [id://24640]=perlmeditation: print w/replies, xml ) Need Help??

For those who have read some of my past posts, I freely confess to hastily posting bad regexes. I thought I was decent with them, but then I started rereading Mastering Regular Expressions. Yikes! I still have a lot to learn.

This "Discussion" is primarily aimed at those new to regexes, so my apologies if I seem pedantic.

I've seen a number of regexes posted to Perlmonks that use a dot star (.*) combination to slurp up characters. While it is sometimes not possible to avoid this, Perlmonks should try to avoid it like the Inquisition.

Dot star is often used as a "catch all" in regular expressions. Often you'll see a regex like the following:

$myvar =~ /"(.*)"/;
The intent of this is to capture whatever is inside of parentheses to $1 (this is called backreferencing). However, this fails if $myvar is something like (yes, I know it's a ridiculous example):
$myvar = qq(Tom said "hi. It's me," and Sally replied "get lost, Ovid + is more my style. Doncha think?"); $myvar =~ /"(.*)"/; print $1;
We might expect the final line to print hi. It's me, but it won't. This is because the star quantifier is greedy. It will attempt the larget match that can possibly satisfy the regular expression. What would be printed is
hi. It's me," and Sally replied "get lost, Ovid is more my style. Do +ncha think?
To solve this, many programmers will add a question mark after the quantifier (.*?) which makes the quantifier match as little as possible. This is called lazy or non-greedy matching. Merely adding that little question mark will cause the code to print the hi. It's me, that we were looking for:
$myvar =~ /"(.*?)"/;
So we're all set, right? Wrong. This certainly works better than the greedy dot-star combination. It keeps trying to match until it finds the first quote mark and then tries to get the smallest match that satisfies the regex. Sounds fine, right? Well, no.

There are two problems with both the greedy and lazy version of the dot star: imprecision and tracking.

The imprecision is obvious. The example above shows the imprecision with the greedy version of .*, but the lazy version is more subtle. What happens if you were trying to extract questions in quotes without the trailing question mark? You might think that something like /"(.*?)\?"/ would do the trick. Unfortunately, we get the same result as above because the lazy matching doesn't guarantee the smallest match. It guarantees the smallest match from the first place that a match could possibly begin.

Tracking is another problem with both of them. With the greedy version of dot star, the dot star gobbles up the entire string. The regex engine is then forced to backtrack from the end of the string to find the longest possible match that will satisfy the regex. With the lazy version, the the regex engine is forced to stop after every match to see if the rest of the regex will match (kind of like a lookahead, but with subtle differences. See Mastering Regular Expressions, Second Edition, page 226 for the mechanics of this). With a one-shot regex, these issues are not usually much of a performance hit (though it can be nasty on a complicated regex with no possible match), but if you are forced to iterate over the regex, these "tracking" issues can have quite a performance hit on your program.

The solution is to use a negated character class:

$myvar =~ /"([^"]*)"/;
What's going on there? When a caret "^" is the first character in a character class, it's telling the regex to match anything except what is in the character class. In this case, it is telling it to keep matching anything (including newline) that is not a quote. If there is a possible match, there is no tracking, there's no ambiguity, there's just a straight match. The above regex will match the first quote, capture everything that's not a quote to $1, and then match the end quote. It's fast and precise.

Unfortunately, it's a little more complicated with the "questions in quotes" example. Here's one solution:

$myvar =~ /"((?:[^?"]|\?[^"])*)\?"/;
Which breaks out to:
$myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses [^?"] # Anything that's not a question mark or quote | # or \?[^"] # A question mark not followed by a quote (to a +llow embedded question marks) )* # Zero or more ) # End capture \?"/x; # Followed by a question mark and quote
Yes, it's more work. Yes, it's harder to read. But it works and doesn't have the problems of the dot star.

A potential pitfall: Unless you are using the /s modifier, the .* does not match a newline (\n). A negated character class will happily match a newline if you don't include it. If your target text has them, be sure to include them in the negated character class, if appropriate.

If Monks would like to see more stuff like this, let me know.

Replies are listed 'Best First'.
RE: Death to Dot Star!
by merlyn (Sage) on Jul 27, 2000 at 16:45 UTC
    Aha! One of the classic mistakes was made on this code:
    $myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses [^?"] # Anything that's not a question mark or quote | # or \?[^"] # A question mark not followed by a quote (to a +llow embedded question marks) )* # Zero or more ) # End capture \?"/x; # Followed by a question mark and quote
    Try this with
    $myvar = q{ abc"def??"ghi?"jkl };
    And you'll see that it matches the ghi, not def??. The problem is that the "question mark NOT followed by a quote" can sometimes eat up the question mark that you need to begin your closing delimiter.

    The proper way to tackle this is to "inch-along"...

    $myvar = q{ abc"def??"ghi?"jkl }; print "matched <$1>" if $myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses (?!\?") # not question quote? . # ok to inch along )* # Zero or more ) # End capture \?"/sx; # Followed by a question mark and quote
    which properly prints:
    matched <def?>

    I was tackling this kind of thing a lot when people would keep posing the "how do I match a C comment?" back in the early days Pre-Ilya-RE. I got pretty good at breaking just about any regex that claimed to match a comment, by undoing any assumption made.

    -- Randal L. Schwartz, Perl hacker

      Okay, the title is kind of a joke. It's just a good-natured tweak at merlyn for the brouhaha over his WARNING t0mas wrote BAD CODE node that generated so much flak. No offense intended :)

      merlyn's code was bugging me, but I couldn't quite put my finger on it. My problem was that the dot metacharacter is so indiscriminating that it will match anything. However, I simply assumed that if merlyn posted the code, it must work. His code is great if you're checking for C-style comments that begin and end in something like /* comment here */ or "? comment here ?". But if you read my post, that's not what we were checking for:

        What happens if you were trying to extract questions in quotes without the trailing question mark?
      I mentioned embedded question marks (my idea was that we might have more than one question in a quote), but I never mentioned embedded quotes. I just wanted one set of quotes and my original post bears that out. Here's merlyn's code and my correction:
      #!/usr/bin/perl -w $myvar = q{ abc"def"g"hi?"jkl }; # This regex is from merlyn print "matched <$1>\n" if $myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses (?!\?") # not question quote? . # ok to inch along )* # Zero or more ) # End capture \?"/sx; # Followed by a question mark and quote # This regex is from Ovid print "matched <$1>\n" if $myvar =~ /" # First quote ( # Capture text to $1 (?: # Non-backreferencing parentheses [^?"] # Not a question mark or parentheses | # or \?(?!") # A question mark not followed by a quote )* # Zero or more ) # End Capture \?"/sx; # Followed by a question mark and quote
      The first regex will print matched <def"g"hi>. The second will print matched <hi>.

      No disrespect is intended towards Randal as he was right in pointing out that my first regex was broken.

      Cheers,
      Ovid

        A reply falls below the community's threshold of quality. You may see it by logging in.
      Drat, drat drat! And I was on a roll :) Nice work.

      Rather than simply testing for a question mark followed by a character that is not a quote (\?[^"]), I should have tested for a question mark with a negative look-ahead (\?(?!")) for a quote. This appears to work:

      $myvar =~ /"((?:[^?"]|\?(?!"))*)\?"/';
      Unfortunately, Benchmark shows that it's not quite as fast as merlyn's version.

      For those unfamiliar with lookaheads, they allow you to test for text without "bumping along" the regex. In other words, \?[^"] will check for a question mark followed by a non-quote character, but further matching of the regex continues after the non-quote character. \?(?!") allows you to check for a question mark not followed by a quote, but continues matching after the question mark.

      Note: There is a subtle difference between the negated character class and the negative lookahead. The negated character class generally requires a character after the question mark (in the above example), while the negative lookahead just makes sure that a quote doesn't follow the question mark and doesn't require a character.

      Cheers,
      Ovid

Buzzcutbuddha: RE: Death to Dot Star!
by buzzcutbuddha (Chaplain) on Jul 27, 2000 at 16:10 UTC
    Great commentary.

    Before I purchased the Owl book I was guilty of using the dot/star notation too. oops. Kudos to you for bringing it up here.

    Toadi yes, you can often get this information from Perlman, and the Perldocs, but to see it here will make more people think about it twice before they use it.
RE: Death to Dot Star!
by agoth (Chaplain) on Jul 27, 2000 at 13:57 UTC
    Definitely like to see more,
    having got away with (.*?) up to now, any thoughts, brief eg's are generally useful and worth reading in a spare moment,
    cheers
Re: Death to Dot Star!
by Juerd (Abbot) on Jan 01, 2002 at 06:07 UTC
    If .* is being used to grab data, or in a regex that's just there to verify if something's correct, it is bad most of the time.

    BUT... If s/// is used, .* often is the right choice.

    s!^.*/!!


    Please don't use [\s\S]* or [\d\D]* etc because someone told you .* is bad. You can use .* if you know it's right. (btw, . doesn't match \n unless /s is used, [\d\D] and friends do match \n).

    Summary: if you know what it does, and it does what you want, .* is not bad.

    2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

Re: Death to Dot Star!
by Anonymous Monk on Apr 03, 2001 at 06:32 UTC
    Why not just $myvar =~ /"([^"]*)\?"/
      /"([^"]*)\?"/ almost works, but often, quoted strings allow escapes:
      my $myvar = '"Did John say \"go home\" to Suzie?"'; $myvar =~ /"([^"]*)\?"/;
      The above code will put " to Suzie" (without the quotes) in $1.

      Cheers,
      Ovid

      Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

        So does yours. If you want to allow \" inside strings, you need to use something like /"((?:\\"|[^"])*)\?"/
Re: Death to Dot Star!
by Anonymous Monk on Feb 27, 2002 at 03:21 UTC
    The last task, to match a question, can be solved quite easily by reversing the string.
    my ($m) = reverse($myvar) =~ /"\?(.*?)"/s; $m = reverse $m; print "matched <$m>\n";
    Note that it's OK to use .*? here. And due to optimizations in the regex engine it's even faster to use it than a negative char class, if the following pattern is constant as it is in this case.

    It's also very easy to make it work for escaped quotes by adding a negative look-ahead.     my ($m) = reverse($myvar) =~ /"\?(.*?)"(?!\\)/s; Note that here you must use .*? instead of a negative char class since you want the engine to backtrack.

    -Anomo
      you seem to have missed the point and all the nodes in this thread, not good
        I posted this as a tip to the readers of this thread. The audience is people who aren't too fluent with regexes, and the regex for matching the question is perhaps tough for beginners. I merely wanted to point out that you don't always have to use that cumbersome and easy-to-slip-with solution. And yes, I understand that it was for educational purposes.

        The task was to get rid of .*?'s in this case unwanted behaviour. That is what my solution does. I don't see why this solution doesn't fit into the thread as nicely as Ovid's regex.

        I was a bit sparse with words in my previous post, but I had no intention to devalue Ovid's node. I think it's great. I just wanted to share an idea that was applyable to the problem Ovid was solving.

        -Anomo
        The following article is about using reverse regular expressions, or sexgers (snicker). Its purpose it to increase efficency by avoiding the lookahead mechanism. http://www.perl.com/pub/a/2001/05/01/expressions.html Todd W.
RE: Death to Dot Star!
by toadi (Chaplain) on Jul 27, 2000 at 13:05 UTC
    Nice,

    But most of these things kan be read on the perlman or a good regex book.

    It's good when you see a mistake and correct them, but these kind of things can be better bundled in a tutorial of some kind.

    But good on you to correct mistakes made earlier.

    I sometimes use some solutions and help others with sharing my solutions. And sometimes I realize there's a better way to do it and my way isn't that good way to do it.

    But hey I'm learning. I program perl just as long as I joined perlmonks :-)
    But that's tha advantage of PM I give my solution to a problem on PM and sometimes I get a Reply that corrects or improves my solution.

    --
    My opinions may have changed,
    but not the fact that I am right

Re: Death to Dot Star!
by aquarium (Curate) on Apr 03, 2003 at 11:54 UTC
    I bet your friend still hasn't learnt regexes. I think It's often the case when they (friends) mention they'd like to learn regexes, it realy means that they'd like a magic wand to wave over them so they become regex wizards. btw the greedy behavior of dot star is well known, so if it's taken into consideration there's usually no problem. dot star is best used in conjunction with unique delimiters on either/both sides of it. Chris
      You obviously have not understood the (sometimes catastrophic) performance issues that backtracking can cause. True story. I once saw a person who should have split on the comma instead try to use /(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)/ to pull his data out. Even though this was guaranteed to work on his dataset, he thought that Perl was slow because this performed so pathetically.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://24640]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-03-28 13:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found