http://qs321.pair.com?node_id=728718

xaprb has asked for the wisdom of the Perl Monks concerning the following question:

I'm building a log analysis tool for MySQL logs. (mk-log-parser in Maatkit, http://www.maatkit.org). One of the more important functions is to take a SQL query from the log and "abstract" it into a "fingerprint" of the query. You can think of it as deriving the query's prototype.

Here's an example:

SELECT * from foo where bar > 1923

This query should be lowercased, whitespace collapsed, and parameters removed and replaced by placeholders, thusly:

select * from foo where bar > N

This becomes quite a task to do with regexes. Witness the code in QueryRewriter.pm in the fingerprint() subroutine:

sub fingerprint { my ( $self, $query, $opts ) = @_; $opts ||= {}; $query = lc $query; $query =~ s{ (?<![\w.+-]) [+-]? (?: \d+ (?:[.]\d*)? |[.]\d+ ) (?:e[+-]?\d+)? \b } {N}gx; # Float/real into N $query =~ s/\b0(?:x[0-9a-f]+|b[01]+)\b/N/g; # Hex/bin into N $query =~ s/[xb]'N'/N/g; # Hex/bin into N $query =~ s/\\["']//g; # Turn quoted strings +into S $query =~ s/(["']).*?\1/S/g; # Turn quoted strings +into S $query =~ s/\A\s+//; # Chop off leading whi +tespace $query =~ s/\s{2,}/ /g; # Collapse all whitesp +ace $query =~ s/[\n\r\f]+/ /g; # Collapse newlines et +c $query =~ s/\Ause \S+\Z/use I/; # Abstract the DB in U +SE $query =~ s{ \b(in|values?)\s*\(\s*([NS])\s*,[^\)]*\) } {$1($2+)}gx; # Collapse IN() and VALUES() lists # Table names that end with one or two groups of digits $query =~ s/(?<=\w_)\d+(_\d+)?\b/$1 ? "N_N" : "N"/eg; if ( $opts->{prefixes} ) { # or begin with them... $query =~ s/\b\d+(_\d+)?(?=[a-zA-Z_])/$1 ? "N_N" : "N"/eg; } return $query; }

The biggest problem is, it's not very efficient. The regexes do a lot of backtracking and stuff. I have profiled the resulting program and found that the regex that converts float/real numbers into "N" is particularly heinously slow.

I'd like to do this with a state machine, character-by-character, for a one-pass solution. I could do this in C. But that's what regexes are for, right? State machines that run char-by-char in C.

I need to make this as fast as possible, even at the expense of memory. This is really critical for the log analysis. What thoughts do you have on this problem?