jeanluca has asked for the wisdom of the Perl Monks concerning the following question:
Dear Monks What started as a simple exercise, has now gone beyond my (regex) knowledge.
I need to split the following on ',': $part_sql = "f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7"
The problem is that I don't want to split the stuff inside the CONCAT.
Firts I thought that it was easy, like
my @fields = split(/(?<!\([^\)]+),/, $part_sql);
This raised the following error Variable length lookbehind not implemented in regex; marked by <-- HER
+E in m/(?<!\([^\)]+), <-- HERE / ....
Is there a regex that can do this, or should I do it just differently ?
Cheers LuCa
Update: from the answers I get the impression that a SQL-parser-module would be the best solution, but for now I choose the quick solution provided by martin. Thnx for all the suggestion!!
Re: need to parse firts part of SQL-query (regex question)
by mpeppler (Vicar) on Jan 17, 2008 at 15:05 UTC
|
To expand on roboticus's comments - there is a SQL::Tokenize (I think that's the name) module on CPAN which works pretty well at tokenizing a SQL query. You could use that as a starting point.
Michael
| [reply] |
|
use SQL::Tokenizer;
my $query = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7};
my @tokens = SQL::Tokenizer->tokenize($query);
print join "\n", @tokens;
__END__
f1
,
f2
,
SUM
(
f3
)
,
CONCAT
(
f4
,
f5
,
f6
)
,
f7
| [reply] [d/l] |
|
use SQL::Tokenizer;
my $query = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), sum((f1+f2)*f3)};
my @token = SQL::Tokenizer->tokenize($query);
my $paren_depth = 0;
my $cache = '';
while(my $val = shift @token) {
if ($token[0] eq '(') {
$paren_depth++;
}
if ($val eq ')') {
$paren_depth--;
if ($paren_depth == 0) {
print $cache;
$cache = '';
}
}
if ($paren_depth) {
$cache .= $val;
}
else {
print "$val\n";
}
}
__PRODUCES__
f1
,
f2
,
SUM(f3)
,
CONCAT(f4,f5, f6)
,
sum((f1+f2)*f3)
That's not too shabby. The tokenizer does the heavy lifting, you just have to put the pieces back together again.
• another intruder with the mooring in the heart of the Perl
| [reply] [d/l] |
|
|
|
Agreed - but this seems to me to be a good starting place.I used this to parse SQL source files for over a thousand stored procedures and check that the case of variables, columns, etc. matched when moving a system from a case-insensitive dataserver to a case-sensitive one. It required doing quite a bit of hand-coding to handle the various language elements of Transact-SQL, and ended up with about 700 lines of code to do all the checks, but I mostly got it done...
Michael
| [reply] |
|
| [reply] |
Re: need to parse firts part of SQL-query (regex question)
by roboticus (Chancellor) on Jan 17, 2008 at 14:46 UTC
|
jeanluca:
Parsing can be tricky. Regexes aren't always the right answer. The main problem is that there are varying quoting rules and matching delimiter pairs that interact with each other. Coming up with a regex to handle it properly would be tricky.
Suppose you answered your immediate question. Then you'd encounter an expression like: f1, f2, substr('bizarro)',5,2)
You might lookover the Parse::RecDescent module to see if you can stand to use that.
...roboticus
Update: Regexes can be great for breaking the input stream into tokens, so perhaps you'll find it sufficient to tokenize your input, and then put some of the tokens back together to build the data statement you want. Once you tokenize your strings, then you can look for mismatched things like parens, and glue some tokens back together to get the results you want. (Poor man's parsing?) This approach probably has its own problems though... | [reply] [d/l] |
Re: need to parse firts part of SQL-query (regex question)
by almut (Canon) on Jan 17, 2008 at 16:32 UTC
|
Here's a (somewhat ugly) attempt using Text::Balanced. The
idea is to extract the balanced parentheses fragments, in order to apply
the split to the remaining parts of the string only.
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);
sub mysplit {
my $text = shift;
my @fields;
do {
my ($paren, $post, $pre) = extract_bracketed($text, '()', '[^(
+]*');
my $s = '';
if ($pre) {
$s = $pre;
} elsif (!$paren) {
$s = $post;
$post = '';
}
$s =~ s/^,//; # get rid of superfluous leading comm
+a
my @f = split /,/, $s;
$f[-1] .= $paren if @f; # append balanced parens part to last
+ elem
push @fields, @f;
$text = $post;
} while ($text);
return @fields;
}
my $sql = "f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7";
print "$_\n" for mysplit($sql);
Output:
f1
f2
SUM(f3)
CONCAT(f4,f5, f6)
f7
(Not well tested — also, I have that feeling there must be something more elegant than this
mess of conditionals... but it's escaping me right now :)
| [reply] [d/l] [select] |
Re: need to parse firts part of SQL-query (regex question)
by johngg (Canon) on Jan 17, 2008 at 17:13 UTC
|
You can attack this from both ends of the string using split with a third argument limiting the number of resultant fields and reverse.
#!/usr/local/bin/perl -l
#
use strict;
use warnings;
my $sql = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7};
my @fields = split m{\s*,\s*}, $sql, 4;
splice @fields, 3, 1,
reverse
map { $_ = reverse }
split m{\s*,\s*}, reverse($fields[3]), 2;
print for @fields;
Here's the output.
f1
f2
SUM(f3)
CONCAT(f4,f5, f6)
f7
The use of spaces with commas was inconsistent so I decided to discard the spaces during the split.
Cheers, JohnGG | [reply] [d/l] [select] |
Re: need to parse firts part of SQL-query (regex question)
by martin (Friar) on Jan 17, 2008 at 18:22 UTC
|
How about this: Think positive and use a regex for the fields you want rather than what might or might not separate them, like this:
my $part_sql = "f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7";
my @fields =
$part_sql =~ m{
(?: ^ | , ) # start of string or comma
( # begin capture
(?: # either:
[^,()] # non-paren non-comma
| # or:
\( # left paren
[^()]* # any amount of non-paren
\) # right paren
)* # as often as possible
) # end capture
}gx; # take all matches
print map { "$_\n" } @fields;
Of course, this simplistic approach does not handle nested parentheses.
| [reply] [d/l] |
|
Quite an interesting solution. Of course, as you have stated, it does not handle nested parentheses. Which, of course, is something that I need to do.
Any takers on how this can be adapted to handle nested parens?
| [reply] |
Re: need to parse firts part of SQL-query (regex question)
by bart (Canon) on Jan 17, 2008 at 22:33 UTC
|
| [reply] |
Re: need to parse firts part of SQL-query (regex question)
by KurtSchwind (Chaplain) on Jan 17, 2008 at 16:28 UTC
|
This is a tricky bit of code. And you can use the tokenizer modules for SQL.
However, another way is to take this in 2 steps. Using a temp variable for your sql, you could actually re-write the comas inside the concat to something else. Split on comas. And then take the part with the concat and return the values to comas again. For example, if you know you never have '#' characters in your sql, you could temporarily replace the comas in your concat to #s.
--
I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
| [reply] |
|
|