http://qs321.pair.com?node_id=662879

jeanluca has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

What started as a simple exercise, has now gone beyond my (regex) knowledge.
I need to split the following on ',':
$part_sql = "f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7"
The problem is that I don't want to split the stuff inside the CONCAT.
Firts I thought that it was easy, like
my @fields = split(/(?<!\([^\)]+),/, $part_sql);
This raised the following error
Variable length lookbehind not implemented in regex; marked by <-- HER +E in m/(?<!\([^\)]+), <-- HERE / ....
Is there a regex that can do this, or should I do it just differently ?

Cheers
LuCa

Update: from the answers I get the impression that a SQL-parser-module would be the best solution, but for now I choose the quick solution provided by martin.
Thnx for all the suggestion!!

Replies are listed 'Best First'.
Re: need to parse firts part of SQL-query (regex question)
by mpeppler (Vicar) on Jan 17, 2008 at 15:05 UTC
    To expand on roboticus's comments - there is a SQL::Tokenize (I think that's the name) module on CPAN which works pretty well at tokenizing a SQL query. You could use that as a starting point.

    Michael

      Do you mean SQL::Tokenizer? That doesn't seem to do what the OP wants:

      use SQL::Tokenizer; my $query = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7}; my @tokens = SQL::Tokenizer->tokenize($query); print join "\n", @tokens; __END__ f1 , f2 , SUM ( f3 ) , CONCAT ( f4 , f5 , f6 ) , f7
        That doesn't seem to do what the OP wants

        Come now young man, where's your sense of adventure? With a bit of lookahead and a state machine you can easily massage the token stream into something useful:

        use SQL::Tokenizer; my $query = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), sum((f1+f2)*f3)}; my @token = SQL::Tokenizer->tokenize($query); my $paren_depth = 0; my $cache = ''; while(my $val = shift @token) { if ($token[0] eq '(') { $paren_depth++; } if ($val eq ')') { $paren_depth--; if ($paren_depth == 0) { print $cache; $cache = ''; } } if ($paren_depth) { $cache .= $val; } else { print "$val\n"; } } __PRODUCES__ f1 , f2 , SUM(f3) , CONCAT(f4,f5, f6) , sum((f1+f2)*f3)

        That's not too shabby. The tokenizer does the heavy lifting, you just have to put the pieces back together again.

        • another intruder with the mooring in the heart of the Perl

        Agreed - but this seems to me to be a good starting place.

        I used this to parse SQL source files for over a thousand stored procedures and check that the case of variables, columns, etc. matched when moving a system from a case-insensitive dataserver to a case-sensitive one. It required doing quite a bit of hand-coding to handle the various language elements of Transact-SQL, and ended up with about 700 lines of code to do all the checks, but I mostly got it done...

        Michael

Re: need to parse firts part of SQL-query (regex question)
by roboticus (Chancellor) on Jan 17, 2008 at 14:46 UTC
    jeanluca:

    Parsing can be tricky. Regexes aren't always the right answer. The main problem is that there are varying quoting rules and matching delimiter pairs that interact with each other. Coming up with a regex to handle it properly would be tricky.

    Suppose you answered your immediate question. Then you'd encounter an expression like: f1, f2, substr('bizarro)',5,2)

    You might lookover the Parse::RecDescent module to see if you can stand to use that.

    ...roboticus

    Update: Regexes can be great for breaking the input stream into tokens, so perhaps you'll find it sufficient to tokenize your input, and then put some of the tokens back together to build the data statement you want. Once you tokenize your strings, then you can look for mismatched things like parens, and glue some tokens back together to get the results you want. (Poor man's parsing?) This approach probably has its own problems though...

Re: need to parse firts part of SQL-query (regex question)
by almut (Canon) on Jan 17, 2008 at 16:32 UTC

    Here's a (somewhat ugly) attempt using Text::Balanced. The idea is to extract the balanced parentheses fragments, in order to apply the split to the remaining parts of the string only.

    use strict; use warnings; use Text::Balanced qw(extract_bracketed); sub mysplit { my $text = shift; my @fields; do { my ($paren, $post, $pre) = extract_bracketed($text, '()', '[^( +]*'); my $s = ''; if ($pre) { $s = $pre; } elsif (!$paren) { $s = $post; $post = ''; } $s =~ s/^,//; # get rid of superfluous leading comm +a my @f = split /,/, $s; $f[-1] .= $paren if @f; # append balanced parens part to last + elem push @fields, @f; $text = $post; } while ($text); return @fields; } my $sql = "f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7"; print "$_\n" for mysplit($sql);

    Output:

    f1 f2 SUM(f3) CONCAT(f4,f5, f6) f7

    (Not well tested — also, I have that feeling there must be something more elegant than this mess of conditionals... but it's escaping me right now :)

Re: need to parse firts part of SQL-query (regex question)
by johngg (Canon) on Jan 17, 2008 at 17:13 UTC
    You can attack this from both ends of the string using split with a third argument limiting the number of resultant fields and reverse.

    #!/usr/local/bin/perl -l # use strict; use warnings; my $sql = q{f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7}; my @fields = split m{\s*,\s*}, $sql, 4; splice @fields, 3, 1, reverse map { $_ = reverse } split m{\s*,\s*}, reverse($fields[3]), 2; print for @fields;

    Here's the output.

    f1 f2 SUM(f3) CONCAT(f4,f5, f6) f7

    The use of spaces with commas was inconsistent so I decided to discard the spaces during the split.

    Cheers,

    JohnGG

Re: need to parse firts part of SQL-query (regex question)
by martin (Friar) on Jan 17, 2008 at 18:22 UTC
    How about this: Think positive and use a regex for the fields you want rather than what might or might not separate them, like this:

    my $part_sql = "f1,f2, SUM(f3),CONCAT(f4,f5, f6), f7"; my @fields = $part_sql =~ m{ (?: ^ | , ) # start of string or comma ( # begin capture (?: # either: [^,()] # non-paren non-comma | # or: \( # left paren [^()]* # any amount of non-paren \) # right paren )* # as often as possible ) # end capture }gx; # take all matches print map { "$_\n" } @fields;

    Of course, this simplistic approach does not handle nested parentheses.

      Quite an interesting solution. Of course, as you have stated, it does not handle nested parentheses. Which, of course, is something that I need to do.

      Any takers on how this can be adapted to handle nested parens?

Re: need to parse firts part of SQL-query (regex question)
by bart (Canon) on Jan 17, 2008 at 22:33 UTC
Re: need to parse firts part of SQL-query (regex question)
by KurtSchwind (Chaplain) on Jan 17, 2008 at 16:28 UTC

    This is a tricky bit of code. And you can use the tokenizer modules for SQL.

    However, another way is to take this in 2 steps. Using a temp variable for your sql, you could actually re-write the comas inside the concat to something else. Split on comas. And then take the part with the concat and return the values to comas again. For example, if you know you never have '#' characters in your sql, you could temporarily replace the comas in your concat to #s.

    --
    I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.