Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

How to find the outermost pair of square brackets with a regex?

by lokiloki (Beadle)
on Jan 17, 2007 at 00:53 UTC ( [id://595005]=perlquestion: print w/replies, xml ) Need Help??

lokiloki has asked for the wisdom of the Perl Monks concerning the following question:

There is a string like the following:
blah blah blah blah blah blah blah blah [blah [blah blah] [blah blah blah blah] blah] blah blah
Essentially, some text, and within that text there is text surrounded by square brackets. that enclosed text itself can contain additional square bracket pairs... my question is... using s/// how can i find the text within the outermost brackets?

Replies are listed 'Best First'.
Re: How to find the outermost pair of square brackets with a regex?
by jettero (Monsignor) on Jan 17, 2007 at 01:00 UTC

    This seems to come up pretty regularly. The best answer seems to be Text::Balanced, which I recently bookmarked on my home node due to it's awesomeness. Sadly, I can't remember who pointed it out to me.

    -Paul

Re: How to find the outermost pair of square brackets with a regex?
by ikegami (Patriarch) on Jan 17, 2007 at 01:02 UTC

    With difficulty.

    our $re; local $re = qr{ \[ (?: (?> [^\[\]]+ ) | (??{ $re }) )* \] }x; my $s = <<'__EOI__'; blah blah blah blah blah blah blah blah [blah [blah blah] [blah blah blah blah] blah] blah blah __EOI__ $s =~ s/$re/moo/g; print($s);

    Partial credits to perlre.

      that's beautiful... so, if i have this right... match the first bracket, IF, non-backtracking, we find an additional or , then recurse, match the last bracket? what is the purpose of the ^?
        sorry, i meant to say...

        that's beautiful...

        so, if i have this right... match the first bracket; IF, non-backtracking, we find an additional [ or ], then recurse; match the last bracket?

        what is the purpose of the ^?

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: How to find the outermost pair of square brackets with a regex?
by murugu (Curate) on Jan 17, 2007 at 07:54 UTC
Re: How to find the outermost pair of square brackets with a regex?
by Cody Pendant (Prior) on Jan 17, 2007 at 06:08 UTC
    I'm confused -- the text within the outermost brackets?

    Surely this will do it:

    $str = ' blah blah blah blah blah blah blah blah [blah [blah blah] [blah blah blah blah] blah] blah blah'; $str =~ m/\[(.*)\]/s; print $1;

    Because the regex finds the leftmost pattern, /s allows dot to match linebreaks and dot-star is greedy, that's all you need. Am I missing something?



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
      good idea, but this won't work because there may be multiple pairs of outermost (i.e., same level) brackets. so, a greedy regular expression will grab beyond. i should have clarified this. for example:

      $str = ' blah blah blah blah blah blah blah blah [blah [blah blah] [blah blah blah blah] blah] blah blah blah blah blah blah blah blah blah blah [blah [blah blah] [blah blah blah blah] blah] blah blah blah blah blah blah blah blah blah blah [blah [blah blah] [blah blah blah blah] blah] blah blah blah blah blah blah blah blah blah blah [blah [blah blah] [blah blah blah blah] blah] blah blah';

      in other words, there are multiple top-level bracket pairs, which, may or may not, contain additional pairs (etc).

      btw, here is the code snippet that i finally used (which has some nuances that i didn't include in the original question), based on the previous suggestion:

      my $re = qr{\[(?:(?>[^\[\]]+)|(??{$re}))*\]}s; for (;;) { last unless $tempstr =~ s/(\[\w+?\s*=\s*($re|\n|[^\[\] +])+\])/&assign($1)/gies; }

        That code is broken.

        • You can't declare $re on the same line as you're using it. It won't even compile (under strict vars).

        • There's a I reason I used a package variable instead of a lexical. It'll bite you if you interpolate into $re. Best to always use a package variable to avoid the problem entirely.

        • While not a bug per say, the s modifiers on both regeps are useless because you don't use ..

        our $re local $re = qr{\[(?:(?>[^\[\]]+)|(??{$re}))*\]}; for (;;) { last unless $tempstr =~ s/(\[\w+?\s*=\s*($re|\n|[^\[\] +])+\])/assign($1)/gie; }

        It's unfortunate that you modify things without knowing why they were done in the first place.

        I should point out that ikegami's suggestion by itself would surely have worked, but I couldn't get it to for the nuances that I needed, and so I added some possibly unnecessary cruft to account for my needs. I.e., I don't really know why |\n|[^\[\]+])+\] was necessary, but after many hours of trial and error, it was that which made everything (appear to) work.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://595005]
Approved by jettero
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-20 15:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found