http://qs321.pair.com?node_id=212125

Don't ask how I come up with these. Just bear with me. /abc(?#comment)+/ is a source of confusion. What should it be? Does the + quantify (pointlessly) the comment? Or the "c"? Which does Perl do?

Neither. It quantifies the abc as a whole. If you don't think this is a bug, then let me know; otherwise, P5P is going to hear from me. Probably with a patch, too.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
(tye)Re: Yet another regex bug.
by tye (Sage) on Nov 12, 2002 at 05:32 UTC

    I disagree that the + should modify the comment. The comment should behave the same whether it is written (?#comment) or just as #comment\n when using /x. And I don't think anyone would argue that a + at the start of the next line should modify the #comment\n at the end of the previous line. Consider:

    while( <DATA> ) { print; if( m/^([a-z_]\w+)=(\d{1,9}|[a-z_]\w*)$/ ) { print "($1)($2)\n"; } my( $key, $value ); ( $key, $value )= m{ ^ ( [a-z_] # Key name must start with a letter or '_' \w # Subsequent characters can also be digits + # Key names must be at least 2 characters ) = ( # The value can be: \d # If it starts with a digit, it is an intege +r {1,9} # Only up to 9-digit values are allowed. | # or, the value can be: [a-z_] # An identifier must start with a letter or +'_' \w # Subsequent characters can also be digits * # No length limit on IDs ) $ }x && do { print "# ($1)($2)\n"; print "# ($key)($value)\n"; }; ( $key, $value )= m{ ^ ( [a-z_] (?# Key name must start with a letter or '_') \w (?# Subsequent characters can also be digits) + (?# Key names must be at least 2 characters) ) = ( (?# The value can be:) \d (?# If it starts with a digit, it is an inte +ger) {1,9} (?# Only up to 9-digit values are allowed.) | (?# or, the value can be:) [a-z_] (?# An identifier must start with a letter o +r '_') \w (?# Subsequent characters can also be digits +) * (?# No length limit on IDs) ) $ }x && do { print "(?# ($1)($2) )\n"; print "(?# ($key)($value) )\n"; }; print $/; } __END__ this=that one=12
    which outputs:
    this=that (this)(that) # (this)(that) # ()() (?# (this)(that) ) (?# (1)() ) one=12 (one)(12) # (one)(12) # ()() (?# (one)(12) ) (?# (1)() )
    Note how $1 and $2 agree with me and only the return value from m// shows the indicated bug.

            - tye
Re: Yet another regex bug.
by VSarkiss (Monsignor) on Nov 11, 2002 at 22:55 UTC

    I would drive this from the principle that removing the comment should make no difference to what the program does. That is, your example should be exactly the same as /abc+/; it should match one or more c. If it's acting the same as /(?:abc)+/, I'd call it a bug.

      That implies that "removing the comment" is the translation...

      /abc(?#comment)+/  ----> /abc(?#comment)+/ /abc+/
      

      Where as I (and evidently at least 2 other people) expect it to be the translation...

      /abc(?#comment)+/  ------> /abc()+/
      

      Updated: forgot to acctually make the translation i was trying to show.

        Well, your first line is the identity transformation, which isn't removing anything. ;-)

        The parentheses surrounding the ?# are part of the syntax; that is, the comment marker in a regexp begins (?#, not ?#. (Check perlre: all the "funny" extended-pattern elements start with (?, one of the reasons being that it's a mnemonic to "question" what's coming next.1) Thus if you're removing the comment you should remove the parentheses as well.

        1I don't buy the explanation, by the way, but it's there..

Re: Yet another regex bug.
by particle (Vicar) on Nov 11, 2002 at 22:36 UTC

    i agree, the current behaviour is a bug.

    how would you think it should parse?

    i say the '+' should modify the comment, however pointless that may be. modifying the 'c' doesn't seem clear; the modifier should be adjacent to the token on which it acts. dangling modifiers should not be introduced to the already complex pattern matching syntax.

    Update: i've changed my tune. as i thought about it more yesterday, i realized my first impression was incorrect. the '(?#)' construct should act the same as a '#' comment in a pattern match with the 'x' modifier. tye is right on, in his response below ((tye)Re: Yet another regex bug..)

    ~Particle *accelerates*

      I'd say that it should be an warning, or a do-nothing. The + modifies the previous assertation. In this case, the previous assertation is (?#...), which asserts nothing about the stream. Asserting nothing a bunch of times should have the same effect as not asserting nothing at all, or asserting nothing once -- no effect. Of course, asserting nothing more then once probably isn't what you meant, but there's no way of telling what you did mean, so we should warn.


      Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

Re: Yet another regex bug.
by elusion (Curate) on Nov 11, 2002 at 23:29 UTC
    I'm surprised that the comments aren't removed at compile time, or at least removed as the first step in a match. It would seem that following that path would efficiently eliminate all such problems.

    elusion : http://matt.diephouse.com

Re: Yet another regex bug.
by sauoq (Abbot) on Nov 11, 2002 at 23:11 UTC

    I also agree that it's a bug. Good catch.

    -sauoq
    "My two cents aren't worth a dime.";