http://qs321.pair.com?node_id=728718

xaprb has asked for the wisdom of the Perl Monks concerning the following question:

I'm building a log analysis tool for MySQL logs. (mk-log-parser in Maatkit, http://www.maatkit.org). One of the more important functions is to take a SQL query from the log and "abstract" it into a "fingerprint" of the query. You can think of it as deriving the query's prototype.

Here's an example:

SELECT * from foo where bar > 1923

This query should be lowercased, whitespace collapsed, and parameters removed and replaced by placeholders, thusly:

select * from foo where bar > N

This becomes quite a task to do with regexes. Witness the code in QueryRewriter.pm in the fingerprint() subroutine:

sub fingerprint { my ( $self, $query, $opts ) = @_; $opts ||= {}; $query = lc $query; $query =~ s{ (?<![\w.+-]) [+-]? (?: \d+ (?:[.]\d*)? |[.]\d+ ) (?:e[+-]?\d+)? \b } {N}gx; # Float/real into N $query =~ s/\b0(?:x[0-9a-f]+|b[01]+)\b/N/g; # Hex/bin into N $query =~ s/[xb]'N'/N/g; # Hex/bin into N $query =~ s/\\["']//g; # Turn quoted strings +into S $query =~ s/(["']).*?\1/S/g; # Turn quoted strings +into S $query =~ s/\A\s+//; # Chop off leading whi +tespace $query =~ s/\s{2,}/ /g; # Collapse all whitesp +ace $query =~ s/[\n\r\f]+/ /g; # Collapse newlines et +c $query =~ s/\Ause \S+\Z/use I/; # Abstract the DB in U +SE $query =~ s{ \b(in|values?)\s*\(\s*([NS])\s*,[^\)]*\) } {$1($2+)}gx; # Collapse IN() and VALUES() lists # Table names that end with one or two groups of digits $query =~ s/(?<=\w_)\d+(_\d+)?\b/$1 ? "N_N" : "N"/eg; if ( $opts->{prefixes} ) { # or begin with them... $query =~ s/\b\d+(_\d+)?(?=[a-zA-Z_])/$1 ? "N_N" : "N"/eg; } return $query; }

The biggest problem is, it's not very efficient. The regexes do a lot of backtracking and stuff. I have profiled the resulting program and found that the regex that converts float/real numbers into "N" is particularly heinously slow.

I'd like to do this with a state machine, character-by-character, for a one-pass solution. I could do this in C. But that's what regexes are for, right? State machines that run char-by-char in C.

I need to make this as fast as possible, even at the expense of memory. This is really critical for the log analysis. What thoughts do you have on this problem?

Replies are listed 'Best First'.
Re: In search of an efficient query abstractor
by Corion (Patriarch) on Dec 07, 2008 at 15:04 UTC

    I think part of the inefficiency is that you're running multiple regular expressions over your code. You could roll all these regular expressions into one regular expression, at least that's easy as long as the first character of any candidate match is different between all regular expressions:

    my $re_num = qr{ [+-]? (?: \d+ (?:[.]\d*)? |[.]\d+ ) (?:e[+-]?\d+)? \b }x; # Float/real into N my $re_hexbin = qr{\b0(?:x[0-9a-f]+|b[01]+)\b}; # Hex/bin into N my $num_combined = qr{ $re_num | $re_hexbin }; $query =~ s/$num_combined/N/g;

    If you want to replace both, strings and numbers in one go, you'll have to combine the regular expressions as above but capture into (say) $1 for numbers and into $2 for strings and then replace with S or N whether $1 or $2 is defined.

      As I understand it, that would still do backtracking, right? Try to match the first one; if it fails, backtrack and try to match the alternative.

      Another challenge here is that the initial characters are optional. Maybe if I write things out fully it becomes easier to optimize. I'll meditate on that. In that case, it might look like

      my $number = qr{ \b\d+ | \b\d+\.\d+\ | \b\.\d+ .... }x;

      But it already looks like I'm introducing backtracking again. My gut feeling is that I can write a state machine for this that doesn't need to do backtracking. Hmmm :-\

        Of course you can just do what lex does, but if you use a good character class as the first character and thus eliminate backtracking back over the first character (for example, eliminate the optional part of [-+]?\d+ by making it into [-+\d]\d*) you can use the atomic match operator ?> to eliminate lots of backtracking, as you know that your SQL is well-formed, or rather, it's of little concern to you if the SQL is not well-formed.

        Ideally, you have a disjunct set of character classes that start the separate matches. Likely, the disjunct set would be [-+\d] for numbers and ' for strings. If you want to be more careful, you can treat 0[bx]\w+ a bit more discerning, but I wouldn't bother and simply assume instead that the SQL is well-formed.

        Can you identify token separators, and break the input up into stuff which isn't a problem, and stuff which might be ?

        Starting by tidying up:

        $query =~ s/\s+/ /g ; # that's the whitespace $query =~ s/\A\s// ; # strip leading $query =~ s/\s\Z// ; # strip trailing $query = lc($query) ; # all lower case $query =~ s/(["'])((?:\\\1|\1\1|.)*?)\1/mash_s($1, $2)/eg ; # Eliminate separators from quoted string +s sub mash_s { my ($q, $s) = @_ ; $s =~ tr/0-9a-z/\\/c ; return $q.$s.$q ; } ;
        which, in particular, leaves all "..." or '...' strings containing only [0-9a-z\\]. Means that can then attack anything between separator characters:
        $query =~ s/([^ !#\$%()*,\/:;<=>?\@[\]^{|}~]+)/mash_l($1)/eg ; sub mash_l { my ($s) = @_ ; return $s if $s =~ /^(?:[a-z]+|\+|\-)$/ ; return 'N' if $s =~ /^[+-]?(?: (?:\d+(?:\.\d*)? | \.\d+) (?:e[+-]\d+ +)? |(?:0(?: x[0-9a-f]+ |b[01]+ ) ) |x'[0-9a-f]+' |b'[01]+' )$/x ; return 'S' if $s =~ /^(["']).*?\1$/ ; return $s ; } ;
        Sadly, what this shows most clearly is that distinguishing unary and binary '+' and '-' is tricky. The above will cope with 12 + -17 and 12*-5, but will fail on 12+13 or 12 +-13 and so on...

        ...using a parser, where somebody else has done all the hard work, looks like a good trick !

Re: In search of an efficient query abstractor
by CountZero (Bishop) on Dec 07, 2008 at 17:38 UTC
    Did you try SQL::Statement? It includes an SQL-parser:
    use SQL::Statement; use Data::Dumper; my $sql = " SELECT * from foo where bar > 1923 "; my $parser = SQL::Parser->new(); my $stmt = SQL::Statement->new($sql,$parser); printf "Command %s\n",$stmt->command; printf "Columns %s\n",join',',map{$_->name} $stmt->column +s; printf "Tables %s\n",join',',map{$_->name} $stmt->tables +; print "Where operator\n",Dumper($stmt->where);
    Result:
    Command SELECT Columns * Tables FOO Where operator $VAR1 = bless( { 'arg2' => '1923', 'arg1' => bless( { 'table' => 'FOO', 'name' => 'BAR' }, 'SQL::Statement::Column' ), 'neg' => 0, 'op' => '>' }, 'SQL::Statement::Op' );
    Working with the parse tree seems much easier and less error-prone than using your regex-solution.

    The $stmt->where structure is a tree-structure with the arg-leaves either being a string or an object. If it is not an object, then it is trivialy easy to run it through Regexp::Common and Regexp::Common::number to check its "numberness" and what kind of number (decimal, octal, hex, ...)

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I should explain a little extra subtlety here. Suppose I have an SQL statement of the form

      insert into foo (col1, col2, col3) values(1, 2, 3), (4, 5, 6), ......

      Now assume there are a million sets of values there. I want them all mushed down to a single "values(N, N, N)".

      This happens to me all the time in the real world. How fast is this going to be?

      And then, when the parser is done with it, I still have to reassemble the thing into a fingerprint of the query... which will be a lot of special-case code. MySQL's grammar is nontrivial.

      So I don't think this will be a fast or easy way to do what I want, although I appreciate the idea.

        SQL::Statement also understands INSERT statements but unfortunately it will choke on multiple sets of VALUES. So you would have to collapse these into one beforehand anyhow, which of course sort-of defeats the use of this module ...

        May be you can petition the module's author jZED to extend it to include this case?

        If you want to parse the SQL statements yourself, it can do no harm to look at the code of DBI::SQL::Nano which has a nice (but limited) set of regexes that parse them and it shouldn't be too difficult to expend these, although to parse the full MySQL grammar you will probably need something like Parse::RecDescent or Parse::Yapp.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: In search of an efficient query abstractor
by ikegami (Patriarch) on Dec 07, 2008 at 17:59 UTC

    The regexes do a lot of backtracking and stuff.

    Which one? I don't see anywhere that backtracking could occur. Everything there's a choice, only one path can possibly match.

    $query =~ s{ (?<![\w.+-]) [+-]? (?: \d+ (?:[.]\d*)? |[.]\d+ ) (?:e[+-]?\d+)? \b } {N}gx; # Float/real into N

    The trailing \b is buggy.

    "12345 " -> "N " "12345. " -> "N. "

    It matched "12345." until the "\b" was reached, then it backtracked to release the "." from the match. Oops, I guess there is backtracking, but only because of the bug.

      Hmmm. Good catch, although I thought I had a test case to ensure that was handled correctly. I'll check that :) You seem to know more about regexes than I do. Why would you guess this particular regex is slow? Line-by-line profiling proves that it is consuming the vast majority of the time. It consumes almost 300 CPU seconds, and the next most expensive line in this code consumes 89 seconds, on an 8GB file.

      The next-most-worst offender is

      $query =~ s/(?<=\w_)\d+(_\d+)?\b/$1 ? "N_N" : "N"/eg;

      followed by

      $query =~ s/\s{2,}/ /g;

        Are you doing the whole 8GB file at once? If your string starts with "a b c d", $query =~ s/\s{2,}/ /g; needs to copy 32GB of text. Just for the first 10 characters.

        At least for the last case, Perl has an optimized version:

        $query =~ tr[ ][]s;

        and it should be faster or at least as fast as the s/// version. Another version to try would be s/\s+/ /g - there is no need to use the counting variant of {2,}, and skipping might be slower than just writing the output "replacement".

      So it occurs to me there's an OBVIOUS reason why this regex might be the most costly: as other commenters have mentioned, I'm doing this on a bunch of stuff that might be embedded inside of strings, and what have you. So even if I leave all the regexes as they are, I think I can just reorder them and make things potentially a lot faster -- even if I do still have multiple passes through the string. I'll take a look at a strategy of "first, throw away as much as I can, then work with what's left."

Re: In search of an efficient query abstractor
by eye (Chaplain) on Dec 07, 2008 at 18:50 UTC
    There are potentially other small improvements. You could drop:
    $query =~ s/[\n\r\f]+/ /g;
    since you already collapsed whitespace on the line before.

    You might get better performance out of:

    $query =~ s/(["']).*?\1/S/g;
    by replacing it with:
    $query =~ s/"[^"]*"/S/g; $query =~ s/'[^']*'/S/g;
    This avoids capturing characters which may or may not help. Also, you could potentially handle double quoted (") strings earlier in the sequence of regexen if that would reduce the number of potential real number strings exposed to your regex for real numbers and floats.

    Finally, in:

    $query =~ s{ \b(in|values?)\s*\(\s*([NS])\s*,[^\)]*\) } {$1($2+)}gx;
    You can take advantage of the fact that you have already normalized whitespace:
    $query =~ s{ \b(in|values?) *\( *([NS]) *,[^\)]*\) } {$1($2+)}gx;

      Good points. It occurs to me that I could just point you to my profiling results in case you really have a lot of time on your hands. See my earlier reply -- the float/real into N regex is way more expensive than others.

      Replacing the single quoted-string regex with single- and double-quote specific ones is indeed more efficient.
Re: In search of an efficient query abstractor
by samtregar (Abbot) on Dec 07, 2008 at 18:53 UTC
    I think your idea of doing this with a state machine in C is a decent one. Have you tried it? How did it work? Are you planning to write it by hand or generate it from a grammar using something like yacc?

    Another thought - are you sure this has to be really fast? I've done a lot of log analysis work and it's very rare to need a lot of speed. Often it's fine if the job runs overnight as long as it produces the correct answers.

    Alternately, why not throw some hardware at the problem? Can you divide the log file up and run the analysis in parallel on many machines? You'll need to add a way to combine the results from each machine, but for log analysis that's often a simple step.

    -sam

      Thanks. There are two things scaring me away from doing it in C:

      • I try to keep these tools self-contained and platform neutral so you can wget them and just run them with no further ado. I have never built a Perl program that makes calls to a compiled C module, but I imagine it'd be hard to make it wget-able and build it into a single file. Although maybe it could be a C library embedded into __DATA__ that running it could write out to a temporary location or something similarly evil ;)
      • I'm not exactly an expert in C programming. I mean I think I can translate my hand-drawn state machine, which isn't too complicated, into a C state machine pretty simply. But I am pretty sure something like yacc would be better and I'm scared of that.

      So, basically I'm scared. I want to stay in my comfortable safe Perl zone if I can.

      It definitely has to be fast and I can't divide it up, alas. I work with Percona; we are consultants for hire. We log onto people's critical production database servers and figure out why they are slow. One of the top-five tools we use is a log analysis tool. We need something that can crunch the file without requiring installation or whatever. We can't even copy the file elsewhere in most cases, we pretty much just have to operate in read-only mode.

      This code was originally written to be correct, not fast -- back in the good old days when I had the luxuries you mention (I was working on my own DB servers.) It handles a lot of special cases that the log-parsers-ad-nauseum out there don't. (Which is why I'm reinventing the wheel. No one has done this well yet.) Now I have to make it both correct, and fast.

        but I imagine it'd be hard to make it wget-able and build it into a single file.

        While I think solving this in pure-perl is probably do-able, if you do want to drop to C, you could checkout Inline::C.

        This allows you do to exactly what you want, embed C in your perl file and have it All Just Work.

        (It does require a working C build environment where you run the script, and it will also write the .o/.so files generated by the compiler to a cache so that it only compiles as needed. This might create deployment issues.)

        There's no reason to be afraid of C. It's one of the easiest languages to learn (you already learned one of the hardest, Perl!) and there's tons of great code out there to study. There's also no need to worry overmuch about portability - C is probably the most widely supported programming language ever created. If a platform runs Perl I guarantee it can run a C compiler! And you don't even have to install a compiler to get your C code running, you can cross-compile for practically any target you can imagine.

        And just to echo what you've already been told - Inline::C is perfect for this project. It will make gluing your C code into your Perl program very easy. If you already knew XS I'd say not to bother since your call signature should be very simple here, but why bother if you don't have to?

        Finally, when you're done, why not put it on CPAN? I'm sure other people could use a good SQL statement sig generator. I'd be interested in adding it as an optional filter for DBI::Profile and DBI::ProfileDumper, for one.

        -sam

Re: In search of an efficient query abstractor
by talexb (Chancellor) on Dec 08, 2008 at 02:35 UTC

    If one of the existing Perl modules already mentioned doesn't do it for you, I'd highly recommend a state machine; SQL is (from what I know) well enough defined that it should be quite a fun little exercise. However, I wouldn't do it character by character -- that's insane -- I'd go word by word (or rather, token by token).

    And while I love C for its speed, I'd probably write it in Perl for speed of development -- at least until I started having performance problems anyways.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: In search of an efficient query abstractor
by Anonymous Monk on Dec 07, 2008 at 18:55 UTC
    Hmm. My first thought is that this sounds like a compiler question. Try the Parse::RecDescent module and build your own, if nothing else exists. Good luck.

      Thanks. Can you elaborate a little more on how you would approach it?

        The main issue you're running into is that SQL is a context-free language. That is, it belongs to a class of languages whose expressive power is greater than regular languages and thus cannot be parsed using regular expressions. (The "regular" in regular expressions refers to regular languages.) You'll need to use a recursive descent parser to properly handle SQL statements.

        The best place to begin is to find a grammar for the specific dialect of SQL you're using. You mention that you're parsing MySQL logs, so you may be in luck. The MySQL docs include a grammar for each SQL command it understands. For instance, on the doc page for select, the gray box in fixed width font is the grammar.

        I do not know the specifics of RecDescent (never used it), but I think you should be able to give it the grammar and a SQL statement and it will give you a parse tree. That greatly simplifies your job, because now you only have to recognize patterns in the tree. In fact, productions of the grammar using specific rules are probably the exact "prototypes" you're looking for.

        A note on efficiency: recursive descent parser isn't terrible efficient at O(n3) in the size of the input. If speed's the name of the game and you have a lot of time to spend on this, you can try building an LL(1) grammar for SQL which can be parsed in O(n) using a custom parser. Not recommended for the faint of heart.

        Further note: I'm not the anonymous monk in the gp, just a CS student with a passion for theory.

Re: In search of an efficient query abstractor
by Anonymous Monk on Dec 07, 2008 at 21:49 UTC
    I recently wrote a syntax analyzer for a simplified pascal-like language using gnu FLEX and BISON, i think those are the tools you need to build your query abstractor ... it could be done using only FLEX i think ..
Re: In search of an efficient query abstractor
by bduggan (Pilgrim) on Dec 10, 2008 at 14:58 UTC
    If you decided to allow C, you could even start from the mysql source itself and derive a fingerprinter from the actual parser (from sql/sql_yacc.yy, it looks like the parsing is done with yacc/bison).