need to parse firts part of SQL-query (regex question)

jeanluca has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: need to parse firts part of SQL-query (regex question) by mpeppler (Vicar) on Jan 17, 2008 at 15:05 UTC
To expand on roboticus's comments - there is a SQL::Tokenize (I think that's the name) module on CPAN which works pretty well at tokenizing a SQL query. You could use that as a starting point. Michael	[reply]
Re^2: need to parse firts part of SQL-query (regex question) by Not_a_Number (Prior) on Jan 17, 2008 at 17:05 UTC
Do you mean SQL::Tokenizer? That doesn't seem to do what the OP wants: `use SQL::Tokenizer; my $query = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7}; my @tokens = SQL::Tokenizer->tokenize($query); print join "\n", @tokens; __END__ f1 , f2 , SUM ( f3 ) , CONCAT ( f4 , f5 , f6 ) , f7` [download]	[reply] [d/l]
Re^3: need to parse firts part of SQL-query (regex question) by grinder (Bishop) on Jan 17, 2008 at 17:22 UTC
That doesn't seem to do what the OP wants Come now young man, where's your sense of adventure? With a bit of lookahead and a state machine you can easily massage the token stream into something useful: `use SQL::Tokenizer; my $query = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), sum((f1+f2)f3)}; my @token = SQL::Tokenizer->tokenize($query); my $paren_depth = 0; my $cache = ''; while(my $val = shift @token) { if ($token[0] eq '(') { $paren_depth++; } if ($val eq ')') { $paren_depth--; if ($paren_depth == 0) { print $cache; $cache = ''; } } if ($paren_depth) { $cache .= $val; } else { print "$val\n"; } } __PRODUCES__ f1 , f2 , SUM(f3) , CONCAT(f4,f5, f6) , sum((f1+f2)f3)` [download] That's not too shabby. The tokenizer does the heavy lifting, you just have to put the pieces back together again. • another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re^4: need to parse firts part of SQL-query (regex question) by Not_a_Number (Prior) on Jan 17, 2008 at 17:37 UTC
Re^5: need to parse firts part of SQL-query (regex question) by grinder (Bishop) on Jan 17, 2008 at 18:12 UTC
Re^3: need to parse firts part of SQL-query (regex question) by mpeppler (Vicar) on Jan 18, 2008 at 08:11 UTC
Agreed - but this seems to me to be a good starting place. I used this to parse SQL source files for over a thousand stored procedures and check that the case of variables, columns, etc. matched when moving a system from a case-insensitive dataserver to a case-sensitive one. It required doing quite a bit of hand-coding to handle the various language elements of Transact-SQL, and ended up with about 700 lines of code to do all the checks, but I mostly got it done... Michael	[reply]
Re^2: need to parse firts part of SQL-query (regex question) by roboticus (Chancellor) on Jan 17, 2008 at 16:32 UTC
mpeppler++ That looks like a *much* better suggestion than mine! ...roboticus	[reply]
Re: need to parse firts part of SQL-query (regex question) by roboticus (Chancellor) on Jan 17, 2008 at 14:46 UTC
jeanluca: Parsing can be tricky. Regexes aren't always the right answer. The main problem is that there are varying quoting rules and matching delimiter pairs that interact with each other. Coming up with a regex to handle it properly would be tricky. Suppose you answered your immediate question. Then you'd encounter an expression like: `f1, f2, substr('bizarro)',5,2)` You might lookover the Parse::RecDescent module to see if you can stand to use that. ...roboticus Update: Regexes can be great for breaking the input stream into tokens, so perhaps you'll find it sufficient to tokenize your input, and then put some of the tokens back together to build the data statement you want. Once you tokenize your strings, then you can look for mismatched things like parens, and glue some tokens back together to get the results you want. (Poor man's parsing?) This approach probably has its own problems though...	[reply] [d/l]
Re: need to parse firts part of SQL-query (regex question) by almut (Canon) on Jan 17, 2008 at 16:32 UTC
Here's a (somewhat ugly) attempt using Text::Balanced. The idea is to extract the balanced parentheses fragments, in order to apply the split to the remaining parts of the string only. use strict; use warnings; use Text::Balanced qw(extract_bracketed); sub mysplit { my $text = shift; my @fields; do { my ($paren, $post, $pre) = extract_bracketed($text, '()', '[^( +]*'); my $s = ''; if ($pre) { $s = $pre; } elsif (!$paren) { $s = $post; $post = ''; } $s =~ s/^,//; # get rid of superfluous leading comm +a my @f = split /,/, $s; $f[-1] .= $paren if @f; # append balanced parens part to last + elem push @fields, @f; $text = $post; } while ($text); return @fields; } my $sql = "f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7"; print "$_\n" for mysplit($sql); [download] Output: `f1 f2 SUM(f3) CONCAT(f4,f5, f6) f7` [download] (Not well tested — also, I have that feeling there must be something more elegant than this mess of conditionals... but it's escaping me right now :)	[reply] [d/l] [select]
Re: need to parse firts part of SQL-query (regex question) by johngg (Canon) on Jan 17, 2008 at 17:13 UTC
You can attack this from both ends of the string using split with a third argument limiting the number of resultant fields and reverse. `#!/usr/local/bin/perl -l # use strict; use warnings; my $sql = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7}; my @fields = split m{\s,\s}, $sql, 4; splice @fields, 3, 1, reverse map { $_ = reverse } split m{\s,\s}, reverse($fields[3]), 2; print for @fields;` [download] Here's the output. `f1 f2 SUM(f3) CONCAT(f4,f5, f6) f7` [download] The use of spaces with commas was inconsistent so I decided to discard the spaces during the `split`. Cheers, JohnGG	[reply] [d/l] [select]
Re: need to parse firts part of SQL-query (regex question) by martin (Friar) on Jan 17, 2008 at 18:22 UTC
How about this: Think positive and use a regex for the fields you want rather than what might or might not separate them, like this: `my $part_sql = "f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7"; my @fields = $part_sql =~ m{ (?: ^ \| , ) # start of string or comma ( # begin capture (?: # either: [^,()] # non-paren non-comma \| # or: $ # left paren [^()]* # any amount of non-paren $ # right paren )* # as often as possible ) # end capture }gx; # take all matches print map { "$_\n" } @fields;` [download] Of course, this simplistic approach does not handle nested parentheses.	[reply] [d/l]
Re^2: need to parse firts part of SQL-query (regex question) by bfdi533 (Friar) on Mar 27, 2008 at 21:32 UTC
Quite an interesting solution. Of course, as you have stated, it does not handle nested parentheses. Which, of course, is something that I need to do. Any takers on how this can be adapted to handle nested parens?	[reply]
Re: need to parse firts part of SQL-query (regex question) by bart (Canon) on Jan 17, 2008 at 22:33 UTC
Take a look at the article Lexing Your Data by Ovid on perl.com, in which he uses HOP::Lexer to tokenize SQL statements. He did use a bit of a weird grammar, as a trick to circumvent some problems with that module, as I described in the thread HOP::Lexer not doing what I expected. But otherwise, it should fit your bill perfectly.	[reply]
Re: need to parse firts part of SQL-query (regex question) by KurtSchwind (Chaplain) on Jan 17, 2008 at 16:28 UTC
This is a tricky bit of code. And you can use the tokenizer modules for SQL. However, another way is to take this in 2 steps. Using a temp variable for your sql, you could actually re-write the comas inside the concat to something else. Split on comas. And then take the part with the concat and return the values to comas again. For example, if you know you never have '#' characters in your sql, you could temporarily replace the comas in your concat to #s. -- I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.	[reply]


Syntactic Confectionery Delight
	PerlMonks