http://qs321.pair.com?node_id=728730


in reply to Re: In search of an efficient query abstractor
in thread In search of an efficient query abstractor

As I understand it, that would still do backtracking, right? Try to match the first one; if it fails, backtrack and try to match the alternative.

Another challenge here is that the initial characters are optional. Maybe if I write things out fully it becomes easier to optimize. I'll meditate on that. In that case, it might look like

my $number = qr{ \b\d+ | \b\d+\.\d+\ | \b\.\d+ .... }x;

But it already looks like I'm introducing backtracking again. My gut feeling is that I can write a state machine for this that doesn't need to do backtracking. Hmmm :-\

Replies are listed 'Best First'.
Re^3: In search of an efficient query abstractor
by Corion (Patriarch) on Dec 07, 2008 at 15:48 UTC

    Of course you can just do what lex does, but if you use a good character class as the first character and thus eliminate backtracking back over the first character (for example, eliminate the optional part of [-+]?\d+ by making it into [-+\d]\d*) you can use the atomic match operator ?> to eliminate lots of backtracking, as you know that your SQL is well-formed, or rather, it's of little concern to you if the SQL is not well-formed.

    Ideally, you have a disjunct set of character classes that start the separate matches. Likely, the disjunct set would be [-+\d] for numbers and ' for strings. If you want to be more careful, you can treat 0[bx]\w+ a bit more discerning, but I wouldn't bother and simply assume instead that the SQL is well-formed.

Re^3: In search of an efficient query abstractor
by gone2015 (Deacon) on Dec 07, 2008 at 20:44 UTC

    Can you identify token separators, and break the input up into stuff which isn't a problem, and stuff which might be ?

    Starting by tidying up:

    $query =~ s/\s+/ /g ; # that's the whitespace $query =~ s/\A\s// ; # strip leading $query =~ s/\s\Z// ; # strip trailing $query = lc($query) ; # all lower case $query =~ s/(["'])((?:\\\1|\1\1|.)*?)\1/mash_s($1, $2)/eg ; # Eliminate separators from quoted string +s sub mash_s { my ($q, $s) = @_ ; $s =~ tr/0-9a-z/\\/c ; return $q.$s.$q ; } ;
    which, in particular, leaves all "..." or '...' strings containing only [0-9a-z\\]. Means that can then attack anything between separator characters:
    $query =~ s/([^ !#\$%()*,\/:;<=>?\@[\]^{|}~]+)/mash_l($1)/eg ; sub mash_l { my ($s) = @_ ; return $s if $s =~ /^(?:[a-z]+|\+|\-)$/ ; return 'N' if $s =~ /^[+-]?(?: (?:\d+(?:\.\d*)? | \.\d+) (?:e[+-]\d+ +)? |(?:0(?: x[0-9a-f]+ |b[01]+ ) ) |x'[0-9a-f]+' |b'[01]+' )$/x ; return 'S' if $s =~ /^(["']).*?\1$/ ; return $s ; } ;
    Sadly, what this shows most clearly is that distinguishing unary and binary '+' and '-' is tricky. The above will cope with 12 + -17 and 12*-5, but will fail on 12+13 or 12 +-13 and so on...

    ...using a parser, where somebody else has done all the hard work, looks like a good trick !

      About unary/binary: I had the same thought while sketching out a state machine. Obviously you have to keep some context to know which is which. I'm thinking that brute-forcing and just treating such an expression as a number is acceptable for this log analysis. I mean,

      select 5 + 1; select 6; select 8 + 1+-5;

      From the point of view of log analysis, those statements are all similar. Selecting a number is selecting a number, mush them all together and report on them in aggregate.

      Of course that's not strictly true. You might have a silly application that constantly does "select 5" and not so frequently does "select 5 + 5" and you want to be able to distinguish them so you can find the offending code that's causing the first query. But that's a corner case.