Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

The "anchor" misnomer in regexes

by japhy (Canon)
on Dec 16, 2005 at 15:00 UTC ( [id://517258]=perlmeditation: print w/replies, xml ) Need Help??

The most common category of regex questions I see is about greediness and non-greediness, and I think many of you will agree: it's a potentially tricky concept for beginners to understand, and it can bite even seasoned regexers from time to time, present company included. A good number of the questions in this category have to do with matching at the end of a string. Here's an example:
n00b
I have the regex /#(.*?)$/ and the string "this #is an #example string". I want to match "example string" with the regex, but for some reason it's matching "is an #example string" even though I've got a non-greedy .* and I've got it anchored to the end of the string.
Invariably, you'll hear answers like "regexes are left-most longest" and "you want [^#]* instead of .*?" and "start your regex with .*". But I don't think I've seen an answer (at least, it hasn't caught my eye like the way I'd expect it to) that says "you aren't anchoring your regex to the end of the string".

Thus, this meditation. We often refer to \A \b \B \G \z \Z ^ $ as "anchors", but I feel this is a misnomer, a simplification of their real name (which would be something like "string location assertion") that gets in the way of their actual purpose. One could say that the regex /f$/ is anchored to the end of the string it matches, and one would be essentially right, since such a simple regex triggers a couple optimizations in the regex engine that result in the regex really meaning "look for an 'f' at the end of the string" rather than "find an 'f' followed by END-OF-STRING". Clearly, in a string with many f's, the optimized version is much smarter and faster.

But the regex /\s+$/ is not anchored (and this grieves me). I've used this example time and time again when explaining regex reversal and that whole tangent, and it's useful again. If the regex were really anchored (that is, immovable), it would find the end of the string, and then match the regex in that context -- that is, the regex would have to terminate at the end of the string. As it stands, Perl can't optimize the regex that way, and we end up matching EVERY chunk of whitespace, and then testing to see if END-OF-STRING comes after it.

So I think "anchor" is incorrect. But "string position assertion", while accurate, is clunky. So where do we go from here? I'm not sure. This is mainly me venting a frustration. Ideally, in Perl 6, we'll have real anchors and regexes that do what we mean.

Hi ho anchor, aweigh!


Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re: The "anchor" misnomer in regexes
by xdg (Monsignor) on Dec 16, 2005 at 15:49 UTC
    We often refer to \A \b \B \G \z \Z ^ $ as "anchors", but I feel this is a misnomer, a simplification of their real name (which would be something like "string location assertion")

    The term used in perlretut is "zero width assertion". Unfortuntely, that document also is chock full of the "anchor" term, going so far as to say that anchors are an example of a zero-width assertions.

    As to your example, I'm not sure that telling the person that's it's not anchored really helps, as that's not the real misunderstanding they're having. They could just as easily get it wrong looking for a character, not an "anchor":

    $string = "this #is an #example string!"; $re = qr/#(.*?)!/;

    They need to understand that their regex is just doing exactly what they told it to do -- match a "#" and then match anything (except a newline) until the end of line is matched. What they said they wanted and what they told the computer are two different things. (What's that MJD quote about "does what you say, not what you mean"?)

    Telling them what they should do instead ("[^#]") isn't really helpful to them in the long run unless it's accompanied by the explanation of what the regex means to the computer and why that's different than what they think it means.

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: The "anchor" misnomer in regexes
by Perl Mouse (Chaplain) on Dec 16, 2005 at 15:30 UTC
    You want to call /f$/ something that's anchored, but /\s+$/ can't be called anchored, because of what the optimizer can or cannot?

    I don't think anyone will benefit if you let jargon depend on the current state of the optimizer. That would only be confusing - and it will expose the implementation. The meaning of /\s+$/ doesn't change whether the optimizer catches it or not. And if neither the meaning, nor the syntax changes, the name shouldn't change either.

    IMO, /\s+$/ is anchored, because it will only match if there's whitespace in front of the end of the string - and it will match that whitespace. How the regex engine comes to that conclusion doesn't matter.

    Your example as an example of anchoring being a misleading term is itself misleading. It's exactly doing what it means: it makes sure that what needs to be matched ends at the end of the line. Remove the anchor, and it'll match something else (the empty string). If the problem is in the anchoring, then it's because you don't have an anchor at the beginning of your match. Not that there's an easy anchor for that - never the less, the anchor at the end of the regex is correct. With a correct name. Which also explains why you aren't seeing replies "you aren't anchoring your regex to the end of the string". Because you are.

    Perl --((8:>*
      Remove the anchor, and it'll match something else (the empty string).

      How will \s+ match an empty string? The plus means "match one or more", the \s means "whitespace characters". An empty string has no whitespace characters to match, because it has no characters at all.

Re: The "anchor" misnomer in regexes
by halley (Prior) on Dec 16, 2005 at 17:26 UTC
    Agreed with the above two, that the problem is in what some people think is implied by the term "anchor."

    If you think the anchor is a "starting place" on which the regex algorithm should work, then you will definitely get confused about /\s+$/ and other examples you've shown. The ability of some regex parsers to get a performance enhancement by noticing that the regex ends with an easy-to-find constraint has nothing to do with the syntax of the forward-matching operation.

    Maybe this is a bit whimsical, but I've never thought of an anchor as the starting place for traversing a ship, so maybe this is why I've never been confused about anchors with regards to reverse-processing of regular expressions. Only the crabs and rats would consider an anchor as an entrypoint. The sailors surely think of an anchor as a restraint on the usual forward-moving operation of the ship.

    --
    [ e d @ h a l l e y . c c ]

      Actually, I expect a sailor would think of the anchor as preventing backward motion of the ship, since the anchor hangs off the bow and causes the ship to point into the wind and/or current.
Re: The "anchor" misnomer in regexes
by duff (Parson) on Dec 16, 2005 at 17:29 UTC

    FWIW, I agree with you that "anchor" is a misnomer. It gives the impression that when you say /foo$/ that the $ is a fixed location that the RE engine revolves around (which, as you say, is only the case under certain optimizations).

    Ideally, in Perl 6, we'll have real anchors ...

    I'm not quite sure what you mean, but it sounds like you think we'll have a way to tell the rules engine something like "jump to this point in the input stream and work sideways" or something. I don't think that "anchors" will be any different in perl6 than they are in perl5 except that we'll probably have more of them by default and they'll all be called "zero-width assertions" as is appropriate. It could just be my lack of imagination, but I don't foresee the rules engine having a way to easily switch its directionality for instance; it will still be primarily left-to-right.

      Hmm, well, the engine is required to switch directionality to check lookbehind assertions, and given that, the optimizer would be silly not to turn /\s+$/ into /$<?after \s+>/. (With allowances for differing capture semantics, of course.)
        AFAIK, the current engine has no such requirement. Instead the restriction is placed that says that only fixed-length lookbehind expressions will work. That way the engine can get to the assertion, backs up, tries to match, then proceed based on that answer. No need to match backwards.

        By contrast Java can match backwards, and the capacity shows in its ability to do variable width lookbehind assertions.

        Well, I guess it was my lack of imagination then :-) But does that mean that right-to-left directionality changes will all be encapsulated in lookbehinds?

Re: The "anchor" misnomer in regexes
by demerphq (Chancellor) on Dec 16, 2005 at 18:51 UTC

    But the regex /\s+$/ is not anchored (and this grieves me). I've used this example time and time again when explaining regex reversal and that whole tangent, and it's useful again. If the regex were really anchored (that is, immovable), it would find the end of the string, and then match the regex in that context -- that is, the regex would have to terminate at the end of the string. As it stands, Perl can't optimize the regex that way, and we end up matching EVERY chunk of whitespace, and then testing to see if END-OF-STRING comes after it.

    I think its important to remember this point only applies to \z (and maybe, I think, \Z) and where $ is the same as \z.

    The logic makes sense when $ is actually end of line and not end of string, as it means doing a scan through the string to find the newline, which means it might as well keep track of state as it goes so it knows where the spaces start.

    I guess what is a pity is that there isn't an optimization that handles this as a special case. It seems to me that patterns of this type could be special cased somehow. (And I mean this in the general case of /CLASS QUANTIFIER EOS/ and not just /\s+$/)

    ---
    $world=~s/war/peace/g

      I’m not sure you have the semantics all correct: \z is always “end of string” and nothing else. \Z and $ are identical by default and mean “end of string, but before a newline if that’s the last character.” However, if you use the /m regex modifier, then $ (but not \Z) changes to line-based semantics and essentially says “before the next newline or at the end of the string (whichever is first).”

      Just to straighten things for anyone for whom they were unclear.

      (The clarification does, of course, not affect what you said about keeping state for $.)

      Makeshifts last the longest.

        Additional clarification, just to muddle things a little:
        However, if you use the /m regex modifier, then $ (but not \Z) changes to line-based semantics and essentially says “before the next newline or at the end of the string (whichever is first).”
        The choice isn't next newline or end of string and isn't whichever is first; regular leftmost/longest backtracking rules apply. So this:
        ("\n" x 10) =~ /(\s+?$)\s{3}\z/m; print length $1;
        gives 7, with the $ matching before the third-to-last newline. Same results without the ?.
Re: The "anchor" misnomer in regexes
by eric256 (Parson) on Dec 16, 2005 at 20:31 UTC

    How does your idea of anchors hold when there are more than one? It is easy to see that $string =~ /#(.*?)$/ could be processed as reverse $string =~ /^(.*?)#/ and it would then produce very different and perhaps more intuitive results. But what would you then do with $string =~ /^#(.*?)$/ .. now your anchor would seem to mean different things and behave differently based on the number of anchors. The current anchors hold well enough and arn't so confusing once you get used to it. After all, find a #, take everything after that up to the end of the line, is pretty straight forward. More so than, take your string, start at the end, and find the first # sign. Which you'll notice is the reverse order of the actual regex so you would have to read it backwards.

    Just my two cents.


    ___________
    Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;
      It is easy to see that $string =~ /#(.*?)$/ could be processed as reverse $string =~ /^(.*?)#/ and it would then produce very different and perhaps more intuitive results.

      Those aren't identical.

      $string = "abc#def#ghi"; ($x) = $string =~ /#(.*?)$/; print "$x\n"; ($x) = reverse($string) =~ /^(.*?)#/; $x = reverse $x; print "$x\n"; ---- def#ghi ghi
      You're still thinking of $ as a starting point versus simply being a zero-width assertion. /#([^#]*)$/ is the one that reverses to /^(.*?)#/. The original regex /#(.*?)$/ is more clearly written as /#(.*)$/. Because you don't have any non-zero-width assertions between the .*? and the $, the non-greedy modifier doesn't actually change any behavior. The reverse of the original is /^(.*)#/ - explicitly greedy.

      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

        I didn't say they were identical, and i can't "still" be thinking of them in any way because its my first post. Reading "How does your idea of anchors hold when there are more than one?" I meant to say that the rest of the post would be regarding how the OP was thinking anchors WOULD be, not how they are. My intention was to show that while that view might make sense with the simple single ended cases, more complex cases make it apparent that the current meaning of anchor is realy the one of least surprise and easiest implementation.


        ___________
        Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;
Re: The "anchor" misnomer in regexes
by itub (Priest) on Dec 17, 2005 at 05:01 UTC
    If the regex were really anchored (that is, immovable), it would find the end of the string, and then match the regex in that context -- that is, the regex would have to terminate at the end of the string. As it stands, Perl can't optimize the regex that way, and we end up matching EVERY chunk of whitespace, and then testing to see if END-OF-STRING comes after it.
    IMO, you shouldn't care about the internals of the regex engine, but about the meaning of the regex you are writing. The exact algorithm and optimizations that the engine uses may change from one version to the next, but the meaning of /#(.*?)$/ will always be the same (at least for Perl 5 ;-)
Re: The "anchor" misnomer in regexes
by CountZero (Bishop) on Dec 17, 2005 at 11:59 UTC
    The misunderstanding in your example comes from the non-greedynes of your pattern.

    Since you did anchor your pattern to the end of the string, there is no need to make .* non greedy. To succeed the pattern-matcher must find a pattern of characters that extends all the way to the end of the string. Adding a ? only confuses the issue.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

"Minimal" matching isn't....
by Anonymous Monk on Dec 19, 2005 at 17:40 UTC
    The matching isn't minimal; there's an obvious smaller match which satisfies the criteria of "hash mark", "as little as possible", "end of string", which perl does not find for this regexp.

    This doesn't satisfy the meaning of "minimal match" in English, so of course it's confusing.

    If the "left-most-longest" principle must be used here, then the intent of the minimal match has been overpowered by a greedy rule. For minimal matches, the "left-most-longest" should be replaced by "left-most-shortest". If it's not, then "mimimal matching" isn't minimal at all!

    The problem with regular expressions is that they attempt to be a set of rules for what gets matched, without a documented set of rules for how that matching gets done. When implicit rules about how matching gets done (ie. minimal matching isn't) raise their ugly head, of course people get confused; they've been told the details don't matter, when ultimately they really do.

      I think part of the problem is the term "minimal" which gets used. "Greedy" is a good term for what * does normally--grab as much as possible and only give it up if you really have to. But the opposite behavior, the one for *? isn't minimal, it's lazy. Take as little as you possibly can, and move on. If you're backtracked into, then take another character. It's not optimizing for the smallest match, just for the first one it can find given its laziness.

      For minimal matches, the “left-most-longest” should be replaced by “left-most-shortest”

      Which is exactly what the shown match is: the shortest match from the left-most possible position. “Left-most” always takes precedence over all other rules. I guess this doesn’t get emphasised enough.

      (Was that your point?)

      Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://517258]
Approved by Corion
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-25 19:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found