Recursive regular expression weirdness

johngg has asked for the wisdom of the Perl Monks concerning the following question:

The question Ido posed, A (non) reg-ex question, and particularly the responses from hv and tilly where they applied recursive regular expressions to the problem piqued my interest. I decided to have a tinker with the example given in the Camel book, 3rd edition pp. 214 which is used to find balanced bracket pairs in a string. I thought I would try to enhance the regular expression with memory groups to see if I could pull out the brackets. So, you see, the pain I have suffered is self-inflicted:-)

I first created some test data and replicated the example in the book in a simple script to test success of failure of the match. Here is the script

#!/usr/bin/perl
#
use strict;
use warnings;

our @toTest = (
   "Cont(ains balanced( nested Br(ack)ets )in t)he text",
   "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
   "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
   "No brackets in this string",
   "Won)ky br(ackets in) this s(tring",
   "More wonky br(ackets in) th)is s(tring",
   "Just the one( leading bracket",
   "And just th)e one trailing bracket",
   "So(me m(ultip)le n(est(s in) thi)s o)ne",
   "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
   "Some d((oub)le b)rackets",
   "ab(())cde",
   "ab(c(d)e",
   "ab(c)d)e");

our $rxNest;
$rxNest = qr
   {(?x)
       \(
       (?:
           (?>[^()]+)
       |
           (??{$rxNest})
       )*
       \)
   };

testString($_) for @toTest;

sub testString
{
    my $string = shift;
    print "\n$string\n";
    print "     Match ", $string =~ /$rxNest/ ?
       "succeeded\n" :
       "failed\n";
}
[download]

and here is the output

Cont(ains balanced( nested Br(ack)ets )in t)he text
     Match succeeded

Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
     Match succeeded

Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
     Match succeeded

No brackets in this string
     Match failed

Won)ky br(ackets in) this s(tring
     Match succeeded

More wonky br(ackets in) th)is s(tring
     Match succeeded

Just the one( leading bracket
     Match failed

And just th)e one trailing bracket
     Match failed

So(me m(ultip)le n(est(s in) thi)s o)ne
     Match succeeded

Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
     Match succeeded

Some d((oub)le b)rackets
     Match succeeded

ab(())cde
     Match succeeded

ab(c(d)e
     Match succeeded

ab(c)d)e
     Match succeeded
[download]

I had assumed without really analysing the regular expression that the match would fail if there were non-balanced brackets but that was obviously not the case. So that's another thing to try and add, I thought, match must fail if brackets don't balance.

Going first to the extraction of the brackets, I added a memory group

...
our $rxNest;
$rxNest = qr
   {(?x)
       (
       \(
       (?:
           (?>[^()]+)
       |
           (??{$rxNest})
       )*
       \)
       )
   };
...
[download]

in the expectation that I could access all of the memory groups created as the match recursed like this

...
    if($string =~ /$rxNest/)
    {
        no strict 'refs';
        print "     Match succeeded\n";
        my $memNo = 1;
        while(defined ${$memNo})
        {
            print "     \$$memNo - ${$memNo}\n";
            $memNo ++;
        }
    }
    else
    {
        print "     Match failed\n";
    }
...
[download]

That did not work. It seems to be a scoping issue with each recursion having it's own $1 with only the outer one visible to the code so I only ever saw everything between the outermost brackets.

As it seemed to do with scoping, I though I would be able to do my own memoizing by executing a bit of code in the regular expression when matching was successful. Like this

...
our @memoList;
our $rxNest;
$rxNest = qr
   {(?x)
       (
       \(
       (?:
           (?>[^()]+)
       |
           (??{$rxNest})
       )*
       \)
       )
       (?{push @memoList, $+})
...
    if($string =~ /$rxNest/)
    {
        print "     Match succeeded\n";
        print "     $_\n" for @memoList;
    }
    else
    {
        print "     Match failed\n";
    }
...
[download]

This worked well, eg

Cont(ains balanced( nested Br(ack)ets )in t)he text
     Match succeeded
     (ack)
     ( nested Br(ack)ets )
     (ains balanced( nested Br(ack)ets )in t)
[download]

but showed up some weirdness when trying to match a string with one too many opening brackets

Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
     Match succeeded
     (ack)
     (ced Br(ack)ets, )
     (mbalan(ced Br(ack)ets, )one c)
     (ack)
     (ced Br(ack)ets, )
     (mbalan(ced Br(ack)ets, )one c)
[download]

The regular expression seemed to go through the string twice so that I saw each nested pair twice. I imagine this is to do with backtracking too far in some way but I can't work out what is going on. I have tried to use re q(debug); on a shorter string with the same imbalance to spot what is going on but I don't really understand the output produced. Can any more savvy Monks shed any light on what is happening here?

Going on to the second problem, which was to try and make the match fail if there were imbalanced brackets, I thought the best way would be to add stuff to anchor the match to the beginning and end of the string. The trouble was, whatever I tried did not seem to work, either failing everything or being too lax again. It was only when I broke the expression into three parts and then amalgamated them that I had success. This works

#!/usr/bin/perl
#
use strict;
use warnings;

our @toTest = (
   "Cont(ains balanced( nested Br(ack)ets )in t)he text",
   "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
   "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
   "No brackets in this string",
   "Won)ky br(ackets in) this s(tring",
   "More wonky br(ackets in) th)is s(tring",
   "Just the one( leading bracket",
   "And just th)e one trailing bracket",
   "So(me m(ultip)le n(est(s in) thi)s o)ne",
   "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
   "Some d((oub)le b)rackets",
   "ab(())cde",
   "ab(c(d)e",
   "ab(c)d)e");

our @memoList;

our ($rxBefore, $rxNest, $rxAfter, $rxWhole);
$rxBefore = $rxAfter = qr{([^()]*)};
$rxNest = qr
   {(?x)
       (
       \(
       (?:
       (?>[^()]+)
       |
       (??{$rxNest})
       )*
       \)
       )
       (?{push @memoList, $+})
   };
$rxWhole = qr{^$rxBefore$rxNest$rxAfter$};

testString($_) for @toTest;

sub testString
{
    my $string = shift;
    @memoList = ();
    print "\nString: $string\n";
    if($string =~ /$rxWhole/)
    {
        print "     Match succeeded\n";
        print "     ---------------\n";
        print "     Before brackets:-\n";
        print "          $1\n";
        print "     Bracket pairs:-\n";
        print "          $_\n" for @memoList;
        print "     After brackets:-\n";
        print "          $3\n";
    }
    else
    {
        print "     Match failed\n";
    }
}
[download]

and produces

String: Cont(ains balanced( nested Br(ack)ets )in t)he text
     Match succeeded
     ---------------
     Before brackets:-
          Cont
     Bracket pairs:-
          (ack)
          ( nested Br(ack)ets )
          (ains balanced( nested Br(ack)ets )in t)
     After brackets:-
          he text

String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
     Match failed

String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
     Match failed

String: No brackets in this string
     Match failed

String: Won)ky br(ackets in) this s(tring
     Match failed

String: More wonky br(ackets in) th)is s(tring
     Match failed

String: Just the one( leading bracket
     Match failed

String: And just th)e one trailing bracket
     Match failed

String: So(me m(ultip)le n(est(s in) thi)s o)ne
     Match succeeded
     ---------------
     Before brackets:-
          So
     Bracket pairs:-
          (ultip)
          (s in)
          (est(s in) thi)
          (me m(ultip)le n(est(s in) thi)s o)
     After brackets:-
          ne

String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
     Match succeeded
     ---------------
     Before brackets:-
          Ther
     Bracket pairs:-
          (re)
          (e)
          ( mo(re) de(e)p )
          (g i)
          (mul)

          (n(g i)n (mul)ti)
          (l)
          (ti(n(g i)n (mul)ti)p(l)es)
          (e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)
     After brackets:-
          re

String: Some d((oub)le b)rackets
     Match succeeded
     ---------------
     Before brackets:-
          Some d
     Bracket pairs:-
          (oub)
          ((oub)le b)
     After brackets:-
          rackets

String: ab(())cde
     Match succeeded
     ---------------
     Before brackets:-
          ab
     Bracket pairs:-
          ()
          (())
     After brackets:-
          cde

String: ab(c(d)e
     Match failed

String: ab(c)d)e
     Match failed
[download]

However, if I put exactly the same regular expression syntax into one expression without breaking it down like this

#!/usr/bin/perl
#
use strict;
use warnings;

our @toTest = (
   "Cont(ains balanced( nested Br(ack)ets )in t)he text",
   "Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing",
   "Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing",
   "No brackets in this string",
   "Won)ky br(ackets in) this s(tring",
   "More wonky br(ackets in) th)is s(tring",
   "Just the one( leading bracket",
   "And just th)e one trailing bracket",
   "So(me m(ultip)le n(est(s in) thi)s o)ne",
   "Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re",
   "Some d((oub)le b)rackets",
   "ab(())cde",
   "ab(c(d)e",
   "ab(c)d)e");

our @memoList;

our $rxNest;
$rxNest = qr
   {(?x)
       ^
       ([^()]*)
       (
       \(
       (?:
           (?>[^()]+)
       |
           (??{$rxNest})
       )*
       \)
       )
       (?{push @memoList, $+})
       ([^()]*)
       $
   };

testString($_) for @toTest;

sub testString
{
    my $string = shift;
    @memoList = ();
    print "\nString: $string\n";
    if($string =~ /$rxNest/)
    {
        print "     Match succeeded\n";
        print "     ---------------\n";
        print "     Before brackets:-\n";
        print "          $1\n";
        print "     Bracket pairs:-\n";
        print "          $_\n" for @memoList;
        print "     After brackets:-\n";
        print "          $3\n";
    }
    else
    {
        print "     Match failed\n";
    }
}
[download]

Nothing matches

String: Cont(ains balanced( nested Br(ack)ets )in t)he text
     Match failed

String: Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing
     Match failed

String: Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing
     Match failed

String: No brackets in this string
     Match failed

String: Won)ky br(ackets in) this s(tring
     Match failed

String: More wonky br(ackets in) th)is s(tring
     Match failed

String: Just the one( leading bracket
     Match failed

String: And just th)e one trailing bracket
     Match failed

String: So(me m(ultip)le n(est(s in) thi)s o)ne
     Match failed

String: Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re
     Match failed

String: Some d((oub)le b)rackets
     Match failed

String: ab(())cde
     Match failed

String: ab(c(d)e
     Match failed

String: ab(c)d)e
     Match failed
[download]

I can't see any logical difference between the regular expression that is made up of three components and works and the one that is all in one piece and fails. What is going on?

I'm sorry that this post is so long but I am so puzzled. I must be missing something obvious but I can't see what. Please show me what I have overlooked, I'm sure I'll kick myself.

Cheers,

JohnGG

Comment on Recursive regular expression weirdness Select or Download Code

Replies are listed 'Best First'.
Re: Recursive regular expression weirdness by diotalevi (Canon) on Mar 29, 2006 at 23:41 UTC
You asked too many questions and your node is too long so I will respond to only one question. `Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing Match succeeded (ack) (ced Br(ack)ets, ) (mbalan(ced Br(ack)ets, )one c) (ack) (ced Br(ack)ets, ) (mbalan(ced Br(ack)ets, )one c)` [download] The engine started matching the initial (, successfully matched those three complete (...) strings, then found that it couldn't find a ) to complete the ( that was already in progress. It skipped to the next ( and tried again. Then it fully succeeded. If you wanted it to fail because there were unbalanced parens, you'd need to have asserted that there were to be no parens leading or following the balanced part. I thinkg that's what you said you were doing later down your node but like I said, it was too long and I didn't read the rest of it. Oh yeah, when you "put it all in one," you used ^ which asserted that you were at the beginning of the string. You can't write a regex that says `$recurse = qr/^...(??{ $recurse })/` because when you recurse you aren't at the beginning of the string nymore. ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊	[reply] [d/l] [select]
Re^2: Recursive regular expression weirdness by johngg (Canon) on Mar 30, 2006 at 07:08 UTC
Thank you for actually answering both questions. It's obvious now that you state it that anchoring and recursion don't go together like that. And I am kicking myself. Cheers, JohnGG	[reply]
Re: Recursive regular expression weirdness by ikegami (Patriarch) on Mar 30, 2006 at 01:00 UTC
diotalevi provided the explanation of the problem. The following is a tested solution: `... our $rxNest; $rxNest = qr{(?x) ( $ [^()]* (?: (??{$rxNest}) [^()]* )* $ ) (?{ [ @{$^R}, $1 ] }) }; our $rxOnlyNested = qr{(?x) (?{ [] }) ^ [^()]* (?: $rxNest [^()]* )* \z (?{ @memoList = @{$^R} }) }; ... if ($string =~ $rxOnlyNested) ...` [download] Note that `@memoList` is now correct whether there is backtracking or not, thanks to the use of `$^R`. Note that I used `$1` instead of `$+`. Using `$+` will negatively impact on the performance of all regexps without captures executed in that perl interpreter instance, whereever they are located. Update: Removed unnecessary parens. (Copy and paste bug.)	[reply] [d/l] [select]
Re^2: Recursive regular expression weirdness by johngg (Canon) on Mar 30, 2006 at 11:44 UTC
Thank you for providing a solution, which I think I understand. Please correct me if I am wrong. The `(?{ [] })` at the start of `$rxOnlyNested` creates an empty list reference and the reference will be held in `$^R` (the result of the last code block). In `$rxNest` the `(?{ [ @{$^R}, $1 ] })` at the end creates a new list reference which contains the dereferenced contents of the last list plus the memory group of the successful match. This list keeps growing as the regular expression recursed, in effect. Finally, in `$rxOnlyNested` the list is dereferenced and assigned to `@memoList`. I hope I have understood correctly and if it is doing what I think it is doing, it is a very neat solution. I am glad I have finally started visiting the Monastery as I am learning a lot of new techniques. Cheers, JohnGG	[reply] [d/l] [select]
Re^3: Recursive regular expression weirdness by ikegami (Patriarch) on Mar 30, 2006 at 15:41 UTC
That's exactly it. Now, you might wonder why I didn't add to @memoList directly. It might be best to use an example: `local $_ = 'abbbbbc'; { local @memoList; / a (?: (b) (?{ push @memoList, $1 }) )* bbc /x; print(@memoList, "\n"); # bbbbb # Wrong! There should only be three! } { local @memoList; / (?{ [] }) a (?: (b) (?{ [ @{$^R}, $1 ] }) )* bbc (?{ @memoList = @{$^R} }) /x; print(@memoList, "\n"); # bbb } { local @temp; local @memoList; / # (?{ @temp = (); }) # Redundant with 'local @temp'. a (?: (b) (?{ local @temp = (@temp, $1); }) )* bbc (?{ @memoList = @temp }) /x; print(@memoList, "\n"); # bbb }` [download] When the regexp engine finds that it read too many 'b's — rememeber that `` is greedy — it backtracks, "unreading" the last 'b' and eventually then a second last 'b'. `$^R` and `@temp` (since we keep using `local` to assign to `@temp`) are unwound when backtracking occurs, so the last assignment is undone with each "unreading". `@memoList` is not unwound, on the other hand, so it keeps holding the extra 'b's. A quick tests shows that using `$^R` is much* faster than using `local`. Update: Removed unnecessary parens. (Copy and paste bug.)	[reply] [d/l] [select]
Re^4: Recursive regular expression weirdness by johngg (Canon) on Mar 30, 2006 at 18:45 UTC
Re^5: Recursive regular expression weirdness by ikegami (Patriarch) on Mar 30, 2006 at 19:05 UTC
Some notes below your chosen depth have not been shown here
Re: Recursive regular expression weirdness by hv (Prior) on Mar 30, 2006 at 01:39 UTC
Your first test (slightly reformatted) had: `$rxNest = qr{(?x) $ (?: (?>[^()]+) \| (??{$rxNest}) )* $ }; $string =~ /$rxNest/;` [download] Note that you can put regexp flags at the end of a qr() expression just as with a normal regexp, so this is the same: `$rxNest = qr{ $ (?: (?>[^()]+) \| (??{$rxNest}) )* $ }x;` [download] The regexp that is being recursively repeated is "find an open/close paren pair with valid nesting of any parens between". Since the match was unanchored, this will locate the first starting point that works; `"contains (im(balanced) parens"` would therefore match `"(balanced)"`, for example. The `(??{$rxNest})` is called a "deferred eval". When the main `/$rxNest/` is compiled, this just appears as a code block in the compiled form - and the compiled form, among other things, needs to know how many capturing parens there are in the pattern. When the deferred eval is invoked the resulting regular expression is independent of the original one from which it was called. That means in particular that the deferred expression has its own capture groups numbering from `$1`, and these are not available to the parent expression when it returns. Your attempt to capture the nested strings with a code block was along the right lines, but to cope with backtracking you need to take advantage of the fact that `local()` will do the right thing. The easy solution is to localise the list: `(?{ local @memoList = (@memoList, $+) })` [download] , but more efficient is to localise just one element at a time: `(?{ local $offset = $offset + 1; local $memoList[$offset] = $+ })` [download] Going on to the second problem, which was to try and make the match fail if there were imbalanced brackets, I thought the best way would be to add stuff to anchor the match to the beginning and end of the string. When using recursion, it is vital to understand what is the repeated part of the recursion. If you have anchors in the repeated part, it probably won't do what you want - it is equivalent to a regexp like `m{^ text ^ more}x`. So you need to take the anchors out of the repeated part, which is as simple as: `$string =~ /^$rxNest\z/;` [download] Not sure if I covered all your points here. As diotalevi says, it would be better shorter - either using shorter examples, or splitting into multiple posts would be better. Hugo	[reply] [d/l] [select]
Re^2: Recursive regular expression weirdness by johngg (Canon) on Mar 30, 2006 at 11:14 UTC
Thank you again for very clear explanations and suggestions which I will take away and try out. With luck this will not result in another post by way of a cry for help. I take your and diotalevi's point about the length of the post. As I was typing I kept thinking "this is getting too long." I should have split it in two. But I am also aware of the number of posts where not enough information is given and Monks are left guessing what the problem might be. Anyway, thanks again, JohnGG	[reply]
Re: Recursive regular expression weirdness by wind (Priest) on Mar 29, 2011 at 22:08 UTC
I found your post while trying to solve one of my own problems at (??{ code }) versus (?PARNO) for recursive regular expressions. Anyway, in that process, I solved your original problem using both (??{ code }) and (?PARNO) which had not been introduced at the time of your question. I'm therefore just sharing them here in case anyone else happens by this thread. #!/usr/bin/perl use strict; use warnings; our $code_re; $code_re = qr{ $ (?: (?>[^()]+) \| (??{$code_re}) )* $ }x; our $parno_re = qr{ ( $ (?: (?>[^()]+) \| (?-1) )* $ ) }x; while (<DATA>) { chomp; print /^(?:(?>[^()]+)\|$code_re)$/ ? 'PASS' : ' '; print /^(?:(?>[^()]+)\|$parno_re)$/ ? ' PASS' : ' '; print " $_\n"; } __DATA__ + Cont(ains balanced( nested Br(ack)ets )in t)he text - Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing - Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing + No brackets in this string - Won)ky br(ackets in) this s(tring - More wonky br(ackets in) th)is s(tring - Just the one( leading bracket - And just th)e one trailing bracket + So(me m(ultip)le n(est(s in) thi)s o)ne + Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) he)re + Some d((oub)le b)rackets + ab(())cde + ab()()cde - ab(c(d)e - ab(c)d)e [download] Output PASS PASS + Cont(ains balanced( nested Br(ack)ets )in t)he text - Con(tains i(mbalan(ced Br(ack)ets, )one c)lose missing - Contains i(mbalan(ced Br(ack)ets, )one op)en m)missing PASS PASS + No brackets in this string - Won)ky br(ackets in) this s(tring - More wonky br(ackets in) th)is s(tring - Just the one( leading bracket - And just th)e one trailing bracket PASS PASS + So(me m(ultip)le n(est(s in) thi)s o)ne PASS PASS + Ther(e is( mo(re) de(e)p )nes(ti(n(g i)n (mul)ti)p(l)es) h +e)re PASS PASS + Some d((oub)le b)rackets PASS PASS + ab(())cde PASS PASS + ab()()cde - ab(c(d)e - ab(c)d)e [download] I also attempted to get the tracking of captured groups working like ikegami and others demonstrated. I can't think of a way to get it to work with (?PARNO). However, I did get a working solution using (??{ code }). The biggest problem I noticed with your doubling of results was that you didn't have enough (?> ) sections to avoid backtracking. `our @matches; our $code_re; $code_re = qr{ ( $ (?: (?>[^()]+) \| (??{$code_re}) )* $ ) (?{ push @matches, $1 }) }x; while (<DATA>) { chomp; @matches = (); print /^(?:(?>[^()]+)\|$code_re)*$/ ? 'PASS' : (@matches = (), ' + '); print " $_\n"; print " $_\n" foreach @matches; }` [download] It might be an interesting challenge to get the results to print out aligned with where they're captured. But this was the limit of the goofing off I'm going to do for now :).	[reply] [d/l] [select]


No such thing as a small change
	PerlMonks