Re^3: In search of an efficient query abstractor

Can you identify token separators, and break the input up into stuff which isn't a problem, and stuff which might be ?

Starting by tidying up:

  $query =~ s/\s+/ /g ;      # that's the whitespace
  $query =~ s/\A\s// ;       # strip leading
  $query =~ s/\s\Z// ;       # strip trailing

  $query = lc($query) ;      # all lower case

  $query =~ s/(["'])((?:\\\1|\1\1|.)*?)\1/mash_s($1, $2)/eg ;
                             # Eliminate separators from quoted string
+s

  sub mash_s {
    my ($q, $s) = @_ ;
    $s =~ tr/0-9a-z/\\/c ;
    return $q.$s.$q ;
  } ;
[download]

which, in particular, leaves all "..." or '...' strings containing only [0-9a-z\\]. Means that can then attack anything between separator characters:

  $query =~ s/([^ !#\$%()*,\/:;<=>?\@[\]^{|}~]+)/mash_l($1)/eg ;

  sub mash_l {
    my ($s) = @_ ;

    return $s if $s =~ /^(?:[a-z]+|\+|\-)$/ ;

    return 'N' if $s =~ /^[+-]?(?:
                                 (?:\d+(?:\.\d*)? | \.\d+) (?:e[+-]\d+
+)?
                                |(?:0(?:
                                        x[0-9a-f]+
                                       |b[01]+
                                      )
                                 )
                                |x'[0-9a-f]+'
                                |b'[01]+'
                               )$/x ;

    return 'S' if $s =~ /^(["']).*?\1$/ ;

    return $s ;
  } ;
[download]

Sadly, what this shows most clearly is that distinguishing unary and binary '+' and '-' is tricky. The above will cope with 12 + -17 and 12*-5, but will fail on 12+13 or 12 +-13 and so on...

...using a parser, where somebody else has done all the hard work, looks like a good trick !

Comment on Re^3: In search of an efficient query abstractor Select or Download Code

Replies are listed 'Best First'.

Re^4: In search of an efficient query abstractor
by xaprb (Scribe) on Dec 07, 2008 at 21:07 UTC

About unary/binary: I had the same thought while sketching out a state machine. Obviously you have to keep some context to know which is which. I'm thinking that brute-forcing and just treating such an expression as a number is acceptable for this log analysis. I mean,

select 5 + 1;
select 6;
select 8 + 1+-5;
[download]

From the point of view of log analysis, those statements are all similar. Selecting a number is selecting a number, mush them all together and report on them in aggregate.

Of course that's not strictly true. You might have a silly application that constantly does "select 5" and not so frequently does "select 5 + 5" and you want to be able to distinguish them so you can find the offending code that's causing the first query. But that's a corner case.

[reply]
[d/l]


Don't ask to ask, just ask
	PerlMonks