Recursive Regular Expression Help

me has asked for the wisdom of the Perl Monks concerning the following question:

I’m trying to write a regex that will capture a function in JavaScript - although the language in which the function is written really doesn’t matter.

I have this JavaScript code:

function firstFunc(){
  if(true){
    alert('testF');
  }
  elseif(false)
    alert('testf');
}
 
function submitme(){
  if(true){
    alert('test1');
  }
  elseif(false){
    alert('test2');
  }
  elseif(false){
    alert('test3');
  }
}
 
function submitmeAlso(){
  if(true){
    alert('test');
  }
  elseif(false)
    alert('test');
}
[download]

I’m trying to capture the entire “submitme()” function from the word “function” all the way to the closing curly bracket of just that function.

I’ve come across a recursive pattern that will capture a pair of outer brackets and all nested ones, but once you add the beginning text - the word ‘function’, the function name, and the opening and closing argument brackets - problems ensue.

This may not be that hard, but I've had a heck of a time writing the right pattern.

Any help (an example pattern) would be greatly appreciated.

Comment on Recursive Regular Expression Help Download Code

Replies are listed 'Best First'.
Re: Recursive Regular Expression Help by ikegami (Patriarch) on Apr 12, 2008 at 02:36 UTC
You could use Text::Balanced	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Recursive Regular Expression Help by bart (Canon) on Apr 12, 2008 at 10:21 UTC
I'm not very comfortable with recursive patterns, they're new in Perl 5.10 and I doubt that PHP/PCRE support them... In that case, I'd take a 2-step approach, very much like the traditional lex/yacc approach, but simplified: tokenize parse (balance braces) 1. Tokenize Using regular expressions, you can pull out the tokens: quoted strings, words, parens/braces/brackets, other symbols. That way you will not accidently mistake braces in quoted strings for syntactically meaningful braces. Your regex engine needs to be capable of continue matching where you left off last time, in Perl you use `//g` in scalar context, in Javascript you can use `//g.exec(string)`. Likely PCRE supports something like it in PHP, but I don't actually know. The regex can look something like this (from the top of my head, not thoroughly tested): `/\d[\w.]\|[\w\$]+\|'(?:\\?.)'\|"(?:\\?.)"\|\/\(?s:.?)\\/\|\/\/.\|\/(? +:\\?.)\/[a-z]\|\+\+\|\-\-\|[\n\S]/g` [download] Note that I skip whitespace except newlines, which are meaningful in Javascript, as they can terminate the current stamement. Maybe (likely) you just don't care. Here's some (Perl) code to test it with — load the Javascript into $_ first: `while(/(\d[\w.]\|[\w\$]+\|'(?:\\?.)'\|"(?:\\?.)"\|\/\(?s:.?)\\/\|\/\/ +.\|\/(?:\\?.)\/[a-z]\|\+\+\|\-\-\|[\n\S])/g) { unless($1 eq "\n") { print "Token: $1\n"; } else { print "Newline\n"; } }` [download] I only display newlines differently because a bare newline as a token doesn't print so clearly. 2. Parsing – balancing braces As you got through the tokens you extract one by one, you keep track of the nesting level: increment it if you encounter a bare "{", decrement it for a bare "}". As soon as it is decremented back to the same level as you started on for this function (usually 0, but it could be higher for nested functions), you found its end. Here's the same code again, extended to keep track of the nesting level. As I assume the Javascript is syntactically valid, I just keep a common $level for every type of bracket, it's just simpler this way. `my $level = 0; while(/(\d[\w.]\|[\w\$]+\|'(?:\\?.)'\|"(?:\\?.)"\|\/\(?s:.?)\\/\|\/\/ +.\|\/(?:\\?.)\/[a-z]\|\+\+\|\-\-\|[\n\S])/g) { if($1 eq "\n") { print "Newline\n"; } elsif(grep $1 eq $_, '(', '{', '[') { print "Token: $1 level $level\n"; $level++; } elsif(grep $1 eq $_, ')', '}', ']') { $level--; print "Token: $1 level $level\n"; print "Found the end of a top level block\n" if $level==0; } else { print "Token: $1\n"; } }` [download] This should suffice to get you started. update I changed the way multiline comments (`/ ... /`) are handled: now the whole comment is one token. I don't know if Javascript supports nested comments, but as it is, my code doesn't support them: It just searches for the next "`/`". I found nothing on the internet about them, so I suppose, if allowed, that they are very rare. I had forgotten about regexes. Added, handled the same like quoted strings (with "`/`" as delimiter, backslash escapes anything) but with a possible suffix of lower case letters for the modifiers.	[reply] [d/l] [select]


There's more than one way to do things
	PerlMonks

Recursive Regular Expression Help

1. Tokenize

2. Parsing – balancing braces