I'm building a log analysis tool for MySQL logs. (mk-log-parser in Maatkit, http://www.maatkit.org). One of the more important functions is to take a SQL query from the log and "abstract" it into a "fingerprint" of the query. You can think of it as deriving the query's prototype.
Here's an example:
SELECT * from foo where bar > 1923
This query should be lowercased, whitespace collapsed, and parameters removed and replaced by placeholders, thusly:
select * from foo where bar > NThis becomes quite a task to do with regexes. Witness the code in QueryRewriter.pm in the fingerprint() subroutine:
sub fingerprint { my ( $self, $query, $opts ) = @_; $opts ||= {}; $query = lc $query; $query =~ s{ (?<![\w.+-]) [+-]? (?: \d+ (?:[.]\d*)? |[.]\d+ ) (?:e[+-]?\d+)? \b } {N}gx; # Float/real into N $query =~ s/\b0(?:x[0-9a-f]+|b[01]+)\b/N/g; # Hex/bin into N $query =~ s/[xb]'N'/N/g; # Hex/bin into N $query =~ s/\\["']//g; # Turn quoted strings +into S $query =~ s/(["']).*?\1/S/g; # Turn quoted strings +into S $query =~ s/\A\s+//; # Chop off leading whi +tespace $query =~ s/\s{2,}/ /g; # Collapse all whitesp +ace $query =~ s/[\n\r\f]+/ /g; # Collapse newlines et +c $query =~ s/\Ause \S+\Z/use I/; # Abstract the DB in U +SE $query =~ s{ \b(in|values?)\s*\(\s*([NS])\s*,[^\)]*\) } {$1($2+)}gx; # Collapse IN() and VALUES() lists # Table names that end with one or two groups of digits $query =~ s/(?<=\w_)\d+(_\d+)?\b/$1 ? "N_N" : "N"/eg; if ( $opts->{prefixes} ) { # or begin with them... $query =~ s/\b\d+(_\d+)?(?=[a-zA-Z_])/$1 ? "N_N" : "N"/eg; } return $query; }
The biggest problem is, it's not very efficient. The regexes do a lot of backtracking and stuff. I have profiled the resulting program and found that the regex that converts float/real numbers into "N" is particularly heinously slow.
I'd like to do this with a state machine, character-by-character, for a one-pass solution. I could do this in C. But that's what regexes are for, right? State machines that run char-by-char in C.
I need to make this as fast as possible, even at the expense of memory. This is really critical for the log analysis. What thoughts do you have on this problem?
In reply to In search of an efficient query abstractor by xaprb
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |