Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

A regexp to parse nested brackets containing strings

by dfaure (Chaplain)
on Sep 23, 2006 at 15:28 UTC ( [id://574517]=perlquestion: print w/replies, xml ) Need Help??

dfaure has asked for the wisdom of the Perl Monks concerning the following question:

Dears Monks,

Developping a custom formula interpreter, I'm using a recursive regexp inspirated from the one here to match inner-most bracketed blocks first. My problem comes at trying to deal with quoted strings containing bracketed text (which should obviously be considered as regular arguments):

my $textInner = '(outer(inner(most "this (shouldn\'t match)" inner)))'; my $innerRe; $innerRe = qr/ # \( # Start with '(', ( # Start capture (?>[^()]+) # Non-parenthesis | (?> "[^"]*")+ # !!! don't work... | (??{ $innerRe }) # Or a balanced () block ) # One time only, aka the inner one \) # Ending with ')' /x; # $textInner =~ /$innerRe/gs; print "inner: $1\n"; __END__ inner: shouldn't match

Any hints on this would be appreciated.

____
HTH, Dominique
My two favorites:
If the only tool you have is a hammer, you will see every problem as a nail. --Abraham Maslow
Bien faire, et le faire savoir...

Replies are listed 'Best First'.
Re: A regexp to parse nested brackets containing strings
by davido (Cardinal) on Sep 23, 2006 at 15:50 UTC

    My problem comes at trying to deal with quoted strings containing bracketed text...

    That's pretty much where one always runs into trouble when using a regular expression to parse balanced text. Now that the (?{...}) and (??{....}) directives exist, it's possible, but messy. This is discussed in perlfaq6 under the section addressing balanced text.

    This is why mankind invented Text::Balanced. As the FAQ states, you might also find helpful clues in Regexp::Common.


    Dave

Re: A regexp to parse nested brackets containing strings
by ikegami (Patriarch) on Sep 23, 2006 at 16:21 UTC
    • (...) should be ( (?:...)* ).
    • [^()]+ should be [^()"]+.
    • $1 will contain the outer match, not the inner one.
    • The s modifier is useless since you don't have any . in that regexp.
    • The s modifier would also be useless on the qr//, which is not affected by the existing s.
    • It wouldn't hurt to anchor the regexp to the start of the string.
    use strict; use warnings; my $textInner = '(outer(inner(most "this (shouldn\'t match)" inner)))'; my $innerRe; $innerRe = qr/ \( ( (?: [^()"]+ | "[^"]*" | (??{ $innerRe }) )* ) \) /sx; $textInner =~ /^$innerRe/g; print "outer: $1\n";

    (Sorry, I took out the comments to debug. Feel free to re-add them.)

Re: A regexp to parse nested brackets containing strings
by rodion (Chaplain) on Sep 23, 2006 at 20:04 UTC
    davido has good advice with Text::Balanced. If you don't use that, at least get out of a totally regex solution and break things apart with Perl code managing the regexs, that way you can see what's going on as you pick things apart.

    If you're like me, it's also tough to just leave behind the ideas of "it should have worked", and "why didn't it". Here's my shot at helping with that:

    The $innerRe doesn't have a "^" in it to anchor it at the beginning of the string. That means it can skip over any characters it comes accross, including open parens and quotes, until it finds something it wants, like the open parens inside the quote. Then all it has to do is find a matching closing paren and it's done. That's why the part of $innerRe that handles embedded quotes isn't doing it's job, it never comes into play.

    That leaves a follow-on question. Shouldn't we just achor the beginning of $innerRe with a "^" followed by "[^()]*" to account for non-paren text? I tried that, but $innerRe is also used recursively later on in the match, and putting "^" at the beginning of $innerRe means you effectively have two occurences of "^", one at the beginning and another later in the regex. So you can't make $InnerRe anchorred at the beginning, and not anchorred at the beginning, all in the same match invocation. You need to do a match, pull the resulting string out, then match that, so that "^" then refers to the string bound to the second match.

    And that's what Text::Balanced is for. It will also help handle the case where you've got a double quoted string, where that string contains an escaped double quote. That can be rough to handle in a regex that's trying to do everything else.

    Update: Added escapes for square brackets (thanks to Fletch). Added "in it" to second paragraph to be clearer (after seeing ikegami's reading of it).

      So you can't make $InnerRe anchorred at the beginning, and not anchorred at the beginning, all in the same match invocation.

      It's not a problem. My solution even does this.

      $\ = "\n"; my $re; $re = qr/ a (??{ $re }) c | b /x; print 'aabcc' =~ /^$re/ ||0; # 1 print '!aabcc' =~ /^$re/ ||0; # 0 First call is anchored to start. print 'a!abcc' =~ /^$re/ ||0; # 0 Recursive call is anchored to pos. print 'aa!bcc' =~ /^$re/ ||0; # 0 Recursive call is anchored to pos.
        Looks like I wasn't specific enough in my writeup. Yes, you can put a "^" anchor in the top-level invocation of $re, but not in the $re itself. For the OP's original problem, where he is looking for the innermost parenthesese that are not inside a quote, you want the "^L in the $re. If you don't put it there, the invocation of $re from within the first invocation of $re can just skip over text to find the parentheses within the quotes. However, "^" doesn't do what you need there, because it doesn't mean the beginning of where $re was invoked in each recursive iteration, it means the beginning of this invocation of the regex parser. At least that's what it looks like from the behavior.
Re: A regexp to parse nested brackets containing strings
by Anonymous Monk on Sep 26, 2006 at 08:45 UTC
    Here is a solution using the (?{}) code block evaluation, not nice but working:
    use strict; use warnings; my $textInner = '(outer(inner(most "this (shouldn\'t match)" inner)))'; my $innerRe; my $idx=0; my(@match); $innerRe = qr/ \( ( (?: [^()"]+ | "[^"]*" | (??{$innerRe}) )* ) \)(?{$match[$idx++]=$1;}) /sx; $textInner =~ /^$innerRe/g; print "inner: $match[0]\n";
Re: A regexp to parse nested brackets containing strings
by Anonymous Monk on Sep 25, 2006 at 12:06 UTC
    Regular expressions strictly cannot parse recursively. A recursive descent or state machine parser (each of which might of course use regular expressions) are what is needed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://574517]
Approved by McDarren
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-20 08:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found