Re^2: In search of an efficient query abstractor

in reply to Re: In search of an efficient query abstractor
in thread In search of an efficient query abstractor

As I understand it, that would still do backtracking, right? Try to match the first one; if it fails, backtrack and try to match the alternative.

Another challenge here is that the initial characters are optional. Maybe if I write things out fully it becomes easier to optimize. I'll meditate on that. In that case, it might look like

my $number = qr{
   \b\d+
   | \b\d+\.\d+\
   | \b\.\d+
   ....
}x;
[download]

But it already looks like I'm introducing backtracking again. My gut feeling is that I can write a state machine for this that doesn't need to do backtracking. Hmmm :-\

Comment on Re^2: In search of an efficient query abstractor Download Code

Replies are listed 'Best First'.
Re^3: In search of an efficient query abstractor by Corion (Patriarch) on Dec 07, 2008 at 15:48 UTC
Of course you can just do what `lex` does, but if you use a good character class as the first character and thus eliminate backtracking back over the first character (for example, eliminate the optional part of `[-+]?\d+` by making it into `[-+\d]\d*`) you can use the atomic match operator `?>` to eliminate lots of backtracking, as you know that your SQL is well-formed, or rather, it's of little concern to you if the SQL is not well-formed. Ideally, you have a disjunct set of character classes that start the separate matches. Likely, the disjunct set would be `[-+\d]` for numbers and `'` for strings. If you want to be more careful, you can treat `0[bx]\w+` a bit more discerning, but I wouldn't bother and simply assume instead that the SQL is well-formed.	[reply] [d/l] [select]
Re^3: In search of an efficient query abstractor by gone2015 (Deacon) on Dec 07, 2008 at 20:44 UTC
Can you identify token separators, and break the input up into stuff which isn't a problem, and stuff which might be ? Starting by tidying up: `$query =~ s/\s+/ /g ; # that's the whitespace $query =~ s/\A\s// ; # strip leading $query =~ s/\s\Z// ; # strip trailing $query = lc($query) ; # all lower case $query =~ s/(["'])((?:\\\1\|\1\1\|.)?)\1/mash_s($1, $2)/eg ; # Eliminate separators from quoted string +s sub mash_s { my ($q, $s) = @_ ; $s =~ tr/0-9a-z/\\/c ; return $q.$s.$q ; } ;` [download] which, in particular, leaves all `"..."` or `'...'` strings containing only `[0-9a-z\\]`. Means that can then attack anything between separator characters: `$query =~ s/([^ !#\$%(),\/:;<=>?\@[\]^{\|}~]+)/mash_l($1)/eg ; sub mash_l { my ($s) = @_ ; return $s if $s =~ /^(?:[a-z]+\|\+\|\-)$/ ; return 'N' if $s =~ /^[+-]?(?: (?:\d+(?:\.\d)? \| \.\d+) (?:e[+-]\d+ +)? \|(?:0(?: x[0-9a-f]+ \|b[01]+ ) ) \|x'[0-9a-f]+' \|b'[01]+' )$/x ; return 'S' if $s =~ /^(["']).?\1$/ ; return $s ; } ;` [download] Sadly, what this shows most clearly is that distinguishing unary and binary '`+`' and '`-`' is tricky. The above will cope with `12 + -17` and `12*-5`, but will fail on `12+13` or `12 +-13` and so on... ...using a parser, where somebody else has done all the hard work, looks like a good trick !	[reply] [d/l] [select]
Re^4: In search of an efficient query abstractor by xaprb (Scribe) on Dec 07, 2008 at 21:07 UTC
About unary/binary: I had the same thought while sketching out a state machine. Obviously you have to keep some context to know which is which. I'm thinking that brute-forcing and just treating such an expression as a number is acceptable for this log analysis. I mean, `select 5 + 1; select 6; select 8 + 1+-5;` [download] From the point of view of log analysis, those statements are all similar. Selecting a number is selecting a number, mush them all together and report on them in aggregate. Of course that's not strictly true. You might have a silly application that constantly does "select 5" and not so frequently does "select 5 + 5" and you want to be able to distinguish them so you can find the offending code that's causing the first query. But that's a corner case.	[reply] [d/l]

In Section Seekers of Perl Wisdom