Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Recursive regular expression weirdness

by johngg (Canon)
on Mar 29, 2006 at 22:47 UTC ( [id://540073]=perlquestion: print w/replies, xml ) Need Help??

johngg has asked for the wisdom of the Perl Monks concerning the following question:

The question Ido posed, A (non) reg-ex question, and particularly the responses from hv and tilly where they applied recursive regular expressions to the problem piqued my interest. I decided to have a tinker with the example given in the Camel book, 3rd edition pp. 214 which is used to find balanced bracket pairs in a string. I thought I would try to enhance the regular expression with memory groups to see if I could pull out the brackets. So, you see, the pain I have suffered is self-inflicted:-)

I first created some test data and replicated the example in the book in a simple script to test success of failure of the match. Here is the script

#!/usr/bin/perl # use strict; use warnings; our @toTest = ( "Cont(ains balanced( nested Br(ack)ets )in t)he text", "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing", "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing", "No brackets in this string", "Won)ky br(ackets in) this s(tring", "More wonky br(ackets in) th)is s(tring", "Just the one( leading bracket", "And just th)e one trailing bracket", "So(me m(ultip)le n(est(s in) thi)s o)ne", "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re", "Some d((oub)le b)rackets", "ab(())cde", "ab(c(d)e", "ab(c)d)e"); our $rxNest; $rxNest = qr {(?x) \( (?: (?>[^()]+) | (??{$rxNest}) )* \) }; testString($_) for @toTest; sub testString { my $string = shift; print "\n$string\n"; print " Match ", $string =~ /$rxNest/ ? "succeeded\n" : "failed\n"; }

and here is the output

Cont(ains balanced( nested Br(ack)ets )in t)he text Match succeeded Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match succeeded Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing Match succeeded No brackets in this string Match failed Won)ky br(ackets in) this s(tring Match succeeded More wonky br(ackets in) th)is s(tring Match succeeded Just the one( leading bracket Match failed And just th)e one trailing bracket Match failed So(me m(ultip)le n(est(s in) thi)s o)ne Match succeeded Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re Match succeeded Some d((oub)le b)rackets Match succeeded ab(())cde Match succeeded ab(c(d)e Match succeeded ab(c)d)e Match succeeded

I had assumed without really analysing the regular expression that the match would fail if there were non-balanced brackets but that was obviously not the case. So that's another thing to try and add, I thought, match must fail if brackets don't balance.

Going first to the extraction of the brackets, I added a memory group

... our $rxNest; $rxNest = qr {(?x) ( \( (?: (?>[^()]+) | (??{$rxNest}) )* \) ) }; ...

in the expectation that I could access all of the memory groups created as the match recursed like this

... if($string =~ /$rxNest/) { no strict 'refs'; print " Match succeeded\n"; my $memNo = 1; while(defined ${$memNo}) { print " \$$memNo - ${$memNo}\n"; $memNo ++; } } else { print " Match failed\n"; } ...

That did not work. It seems to be a scoping issue with each recursion having it's own $1 with only the outer one visible to the code so I only ever saw everything between the outermost brackets.

As it seemed to do with scoping, I though I would be able to do my own memoizing by executing a bit of code in the regular expression when matching was successful. Like this

... our @memoList; our $rxNest; $rxNest = qr {(?x) ( \( (?: (?>[^()]+) | (??{$rxNest}) )* \) ) (?{push @memoList, $+}) ... if($string =~ /$rxNest/) { print " Match succeeded\n"; print " $_\n" for @memoList; } else { print " Match failed\n"; } ...

This worked well, eg

Cont(ains balanced( nested Br(ack)ets )in t)he text Match succeeded (ack) ( nested Br(ack)ets ) (ains balanced( nested Br(ack)ets )in t)

but showed up some weirdness when trying to match a string with one too many opening brackets

Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match succeeded (ack) (ced Br(ack)ets, ) (mbalan(ced Br(ack)ets, )one c) (ack) (ced Br(ack)ets, ) (mbalan(ced Br(ack)ets, )one c)

The regular expression seemed to go through the string twice so that I saw each nested pair twice. I imagine this is to do with backtracking too far in some way but I can't work out what is going on. I have tried to use re q(debug); on a shorter string with the same imbalance to spot what is going on but I don't really understand the output produced. Can any more savvy Monks shed any light on what is happening here?

Going on to the second problem, which was to try and make the match fail if there were imbalanced brackets, I thought the best way would be to add stuff to anchor the match to the beginning and end of the string. The trouble was, whatever I tried did not seem to work, either failing everything or being too lax again. It was only when I broke the expression into three parts and then amalgamated them that I had success. This works

#!/usr/bin/perl # use strict; use warnings; our @toTest = ( "Cont(ains balanced( nested Br(ack)ets )in t)he text", "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing", "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing", "No brackets in this string", "Won)ky br(ackets in) this s(tring", "More wonky br(ackets in) th)is s(tring", "Just the one( leading bracket", "And just th)e one trailing bracket", "So(me m(ultip)le n(est(s in) thi)s o)ne", "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re", "Some d((oub)le b)rackets", "ab(())cde", "ab(c(d)e", "ab(c)d)e"); our @memoList; our ($rxBefore, $rxNest, $rxAfter, $rxWhole); $rxBefore = $rxAfter = qr{([^()]*)}; $rxNest = qr {(?x) ( \( (?: (?>[^()]+) | (??{$rxNest}) )* \) ) (?{push @memoList, $+}) }; $rxWhole = qr{^$rxBefore$rxNest$rxAfter$}; testString($_) for @toTest; sub testString { my $string = shift; @memoList = (); print "\nString: $string\n"; if($string =~ /$rxWhole/) { print " Match succeeded\n"; print " ---------------\n"; print " Before brackets:-\n"; print " $1\n"; print " Bracket pairs:-\n"; print " $_\n" for @memoList; print " After brackets:-\n"; print " $3\n"; } else { print " Match failed\n"; } }

and produces

String: Cont(ains balanced( nested Br(ack)ets )in t)he text Match succeeded --------------- Before brackets:- Cont Bracket pairs:- (ack) ( nested Br(ack)ets ) (ains balanced( nested Br(ack)ets )in t) After brackets:- he text String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match failed String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing Match failed String: No brackets in this string Match failed String: Won)ky br(ackets in) this s(tring Match failed String: More wonky br(ackets in) th)is s(tring Match failed String: Just the one( leading bracket Match failed String: And just th)e one trailing bracket Match failed String: So(me m(ultip)le n(est(s in) thi)s o)ne Match succeeded --------------- Before brackets:- So Bracket pairs:- (ultip) (s in) (est(s in) thi) (me m(ultip)le n(est(s in) thi)s o) After brackets:- ne String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re Match succeeded --------------- Before brackets:- Ther Bracket pairs:- (re) (e) ( mo(re) de(e)p ) (g i) (mul) (n(g i)n (mul)ti) (l) (ti(n(g i)n (mul)ti)p(l)es) (e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he) After brackets:- re String: Some d((oub)le b)rackets Match succeeded --------------- Before brackets:- Some d Bracket pairs:- (oub) ((oub)le b) After brackets:- rackets String: ab(())cde Match succeeded --------------- Before brackets:- ab Bracket pairs:- () (()) After brackets:- cde String: ab(c(d)e Match failed String: ab(c)d)e Match failed

However, if I put exactly the same regular expression syntax into one expression without breaking it down like this

#!/usr/bin/perl # use strict; use warnings; our @toTest = ( "Cont(ains balanced( nested Br(ack)ets )in t)he text", "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing", "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing", "No brackets in this string", "Won)ky br(ackets in) this s(tring", "More wonky br(ackets in) th)is s(tring", "Just the one( leading bracket", "And just th)e one trailing bracket", "So(me m(ultip)le n(est(s in) thi)s o)ne", "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re", "Some d((oub)le b)rackets", "ab(())cde", "ab(c(d)e", "ab(c)d)e"); our @memoList; our $rxNest; $rxNest = qr {(?x) ^ ([^()]*) ( \( (?: (?>[^()]+) | (??{$rxNest}) )* \) ) (?{push @memoList, $+}) ([^()]*) $ }; testString($_) for @toTest; sub testString { my $string = shift; @memoList = (); print "\nString: $string\n"; if($string =~ /$rxNest/) { print " Match succeeded\n"; print " ---------------\n"; print " Before brackets:-\n"; print " $1\n"; print " Bracket pairs:-\n"; print " $_\n" for @memoList; print " After brackets:-\n"; print " $3\n"; } else { print " Match failed\n"; } }

Nothing matches

String: Cont(ains balanced( nested Br(ack)ets )in t)he text Match failed String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match failed String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing Match failed String: No brackets in this string Match failed String: Won)ky br(ackets in) this s(tring Match failed String: More wonky br(ackets in) th)is s(tring Match failed String: Just the one( leading bracket Match failed String: And just th)e one trailing bracket Match failed String: So(me m(ultip)le n(est(s in) thi)s o)ne Match failed String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re Match failed String: Some d((oub)le b)rackets Match failed String: ab(())cde Match failed String: ab(c(d)e Match failed String: ab(c)d)e Match failed

I can't see any logical difference between the regular expression that is made up of three components and works and the one that is all in one piece and fails. What is going on?

I'm sorry that this post is so long but I am so puzzled. I must be missing something obvious but I can't see what. Please show me what I have overlooked, I'm sure I'll kick myself.

Cheers,

JohnGG

Replies are listed 'Best First'.
Re: Recursive regular expression weirdness
by diotalevi (Canon) on Mar 29, 2006 at 23:41 UTC

    You asked too many questions and your node is too long so I will respond to only one question.

    Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match succeeded (ack) (ced Br(ack)ets, ) (mbalan(ced Br(ack)ets, )one c) (ack) (ced Br(ack)ets, ) (mbalan(ced Br(ack)ets, )one c)

    The engine started matching the initial (, successfully matched those three complete (...) strings, then found that it couldn't find a ) to complete the ( that was already in progress. It skipped to the next ( and tried again. Then it fully succeeded.

    If you wanted it to fail because there were unbalanced parens, you'd need to have asserted that there were to be no parens leading or following the balanced part. I thinkg that's what you said you were doing later down your node but like I said, it was too long and I didn't read the rest of it.

    Oh yeah, when you "put it all in one," you used ^ which asserted that you were at the beginning of the string. You can't write a regex that says $recurse = qr/^...(??{ $recurse })/ because when you recurse you aren't at the beginning of the string nymore.

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      Thank you for actually answering both questions. It's obvious now that you state it that anchoring and recursion don't go together like that. And I am kicking myself.

      Cheers,

      JohnGG

Re: Recursive regular expression weirdness
by ikegami (Patriarch) on Mar 30, 2006 at 01:00 UTC
    diotalevi provided the explanation of the problem. The following is a tested solution:
    ... our $rxNest; $rxNest = qr{(?x) ( \( [^()]* (?: (??{$rxNest}) [^()]* )* \) ) (?{ [ @{$^R}, $1 ] }) }; our $rxOnlyNested = qr{(?x) (?{ [] }) ^ [^()]* (?: $rxNest [^()]* )* \z (?{ @memoList = @{$^R} }) }; ... if ($string =~ $rxOnlyNested) ...

    Note that @memoList is now correct whether there is backtracking or not, thanks to the use of $^R.

    Note that I used $1 instead of $+. Using $+ will negatively impact on the performance of all regexps without captures executed in that perl interpreter instance, whereever they are located.

    Update: Removed unnecessary parens. (Copy and paste bug.)

      Thank you for providing a solution, which I think I understand. Please correct me if I am wrong.

      The (?{ [] }) at the start of $rxOnlyNested creates an empty list reference and the reference will be held in $^R (the result of the last code block). In $rxNest the (?{ [ @{$^R}, $1 ] }) at the end creates a new list reference which contains the dereferenced contents of the last list plus the memory group of the successful match. This list keeps growing as the regular expression recursed, in effect. Finally, in $rxOnlyNested the list is dereferenced and assigned to @memoList.

      I hope I have understood correctly and if it is doing what I think it is doing, it is a very neat solution. I am glad I have finally started visiting the Monastery as I am learning a lot of new techniques.

      Cheers,

      JohnGG

        That's exactly it.

        Now, you might wonder why I didn't add to @memoList directly. It might be best to use an example:

        local $_ = 'abbbbbc'; { local @memoList; / a (?: (b) (?{ push @memoList, $1 }) )* bbc /x; print(@memoList, "\n"); # bbbbb # Wrong! There should only be three! } { local @memoList; / (?{ [] }) a (?: (b) (?{ [ @{$^R}, $1 ] }) )* bbc (?{ @memoList = @{$^R} }) /x; print(@memoList, "\n"); # bbb } { local @temp; local @memoList; / # (?{ @temp = (); }) # Redundant with 'local @temp'. a (?: (b) (?{ local @temp = (@temp, $1); }) )* bbc (?{ @memoList = @temp }) /x; print(@memoList, "\n"); # bbb }

        When the regexp engine finds that it read too many 'b's — rememeber that * is greedy — it backtracks, "unreading" the last 'b' and eventually then a second last 'b'. $^R and @temp (since we keep using local to assign to @temp) are unwound when backtracking occurs, so the last assignment is undone with each "unreading". @memoList is not unwound, on the other hand, so it keeps holding the extra 'b's.

        A quick tests shows that using $^R is *much* faster than using local.

        Update: Removed unnecessary parens. (Copy and paste bug.)

Re: Recursive regular expression weirdness
by hv (Prior) on Mar 30, 2006 at 01:39 UTC

    Your first test (slightly reformatted) had:

    $rxNest = qr{(?x) \( (?: (?>[^()]+) | (??{$rxNest}) )* \) }; $string =~ /$rxNest/;

    Note that you can put regexp flags at the end of a qr() expression just as with a normal regexp, so this is the same:

    $rxNest = qr{ \( (?: (?>[^()]+) | (??{$rxNest}) )* \) }x;

    The regexp that is being recursively repeated is "find an open/close paren pair with valid nesting of any parens between". Since the match was unanchored, this will locate the first starting point that works; "contains (im(balanced) parens" would therefore match "(balanced)", for example.

    The (??{$rxNest}) is called a "deferred eval". When the main /$rxNest/ is compiled, this just appears as a code block in the compiled form - and the compiled form, among other things, needs to know how many capturing parens there are in the pattern. When the deferred eval is invoked the resulting regular expression is independent of the original one from which it was called. That means in particular that the deferred expression has its own capture groups numbering from $1, and these are not available to the parent expression when it returns.

    Your attempt to capture the nested strings with a code block was along the right lines, but to cope with backtracking you need to take advantage of the fact that local() will do the right thing. The easy solution is to localise the list:

    (?{ local @memoList = (@memoList, $+) })
    , but more efficient is to localise just one element at a time:
    (?{ local $offset = $offset + 1; local $memoList[$offset] = $+ })

    Going on to the second problem, which was to try and make the match fail if there were imbalanced brackets, I thought the best way would be to add stuff to anchor the match to the beginning and end of the string.

    When using recursion, it is vital to understand what is the repeated part of the recursion. If you have anchors in the repeated part, it probably won't do what you want - it is equivalent to a regexp like m{^ text ^ more}x.

    So you need to take the anchors out of the repeated part, which is as simple as:

    $string =~ /^$rxNest\z/;

    Not sure if I covered all your points here. As diotalevi says, it would be better shorter - either using shorter examples, or splitting into multiple posts would be better.

    Hugo

      Thank you again for very clear explanations and suggestions which I will take away and try out. With luck this will not result in another post by way of a cry for help.

      I take your and diotalevi's point about the length of the post. As I was typing I kept thinking "this is getting too long." I should have split it in two. But I am also aware of the number of posts where not enough information is given and Monks are left guessing what the problem might be.

      Anyway, thanks again,

      JohnGG

Re: Recursive regular expression weirdness
by wind (Priest) on Mar 29, 2011 at 22:08 UTC

    I found your post while trying to solve one of my own problems at (??{ code }) versus (?PARNO) for recursive regular expressions. Anyway, in that process, I solved your original problem using both (??{ code }) and (?PARNO) which had not been introduced at the time of your question. I'm therefore just sharing them here in case anyone else happens by this thread.

    #!/usr/bin/perl use strict; use warnings; our $code_re; $code_re = qr{ \( (?: (?>[^()]+) | (??{$code_re}) )* \) }x; our $parno_re = qr{ ( \( (?: (?>[^()]+) | (?-1) )* \) ) }x; while (<DATA>) { chomp; print /^(?:(?>[^()]+)|$code_re)*$/ ? 'PASS' : ' '; print /^(?:(?>[^()]+)|$parno_re)*$/ ? ' PASS' : ' '; print " $_\n"; } __DATA__ + Cont(ains balanced( nested Br(ack)ets )in t)he text - Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing - Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing + No brackets in this string - Won)ky br(ackets in) this s(tring - More wonky br(ackets in) th)is s(tring - Just the one( leading bracket - And just th)e one trailing bracket + So(me m(ultip)le n(est(s in) thi)s o)ne + Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re + Some d((oub)le b)rackets + ab(())cde + ab()()cde - ab(c(d)e - ab(c)d)e
    Output
    PASS PASS + Cont(ains balanced( nested Br(ack)ets )in t)he text - Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing - Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing PASS PASS + No brackets in this string - Won)ky br(ackets in) this s(tring - More wonky br(ackets in) th)is s(tring - Just the one( leading bracket - And just th)e one trailing bracket PASS PASS + So(me m(ultip)le n(est(s in) thi)s o)ne PASS PASS + Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) h +e)re PASS PASS + Some d((oub)le b)rackets PASS PASS + ab(())cde PASS PASS + ab()()cde - ab(c(d)e - ab(c)d)e

    I also attempted to get the tracking of captured groups working like ikegami and others demonstrated. I can't think of a way to get it to work with (?PARNO). However, I did get a working solution using (??{ code }). The biggest problem I noticed with your doubling of results was that you didn't have enough (?> ) sections to avoid backtracking.

    our @matches; our $code_re; $code_re = qr{ ( \( (?: (?>[^()]+) | (??{$code_re}) )* \) ) (?{ push @matches, $1 }) }x; while (<DATA>) { chomp; @matches = (); print /^(?:(?>[^()]+)|$code_re)*$/ ? 'PASS' : (@matches = (), ' + '); print " $_\n"; print " $_\n" foreach @matches; }

    It might be an interesting challenge to get the results to print out aligned with where they're captured. But this was the limit of the goofing off I'm going to do for now :).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://540073]
Approved by Zaxo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-25 14:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found