I'm building a log analysis tool for MySQL logs. (mk-log-parser in Maatkit, http://www.maatkit.org). One of the more important functions is to take a SQL query from the log and "abstract" it into a "fingerprint" of the query. You can think of it as deriving the query's prototype.

Here's an example:

SELECT * from foo where bar > 1923

This query should be lowercased, whitespace collapsed, and parameters removed and replaced by placeholders, thusly:

select * from foo where bar > N

This becomes quite a task to do with regexes. Witness the code in QueryRewriter.pm in the fingerprint() subroutine:

sub fingerprint { my ( $self, $query, $opts ) = @_; $opts ||= {}; $query = lc $query; $query =~ s{ (?<![\w.+-]) [+-]? (?: \d+ (?:[.]\d*)? |[.]\d+ ) (?:e[+-]?\d+)? \b } {N}gx; # Float/real into N $query =~ s/\b0(?:x[0-9a-f]+|b[01]+)\b/N/g; # Hex/bin into N $query =~ s/[xb]'N'/N/g; # Hex/bin into N $query =~ s/\\["']//g; # Turn quoted strings +into S $query =~ s/(["']).*?\1/S/g; # Turn quoted strings +into S $query =~ s/\A\s+//; # Chop off leading whi +tespace $query =~ s/\s{2,}/ /g; # Collapse all whitesp +ace $query =~ s/[\n\r\f]+/ /g; # Collapse newlines et +c $query =~ s/\Ause \S+\Z/use I/; # Abstract the DB in U +SE $query =~ s{ \b(in|values?)\s*\(\s*([NS])\s*,[^\)]*\) } {$1($2+)}gx; # Collapse IN() and VALUES() lists # Table names that end with one or two groups of digits $query =~ s/(?<=\w_)\d+(_\d+)?\b/$1 ? "N_N" : "N"/eg; if ( $opts->{prefixes} ) { # or begin with them... $query =~ s/\b\d+(_\d+)?(?=[a-zA-Z_])/$1 ? "N_N" : "N"/eg; } return $query; }

The biggest problem is, it's not very efficient. The regexes do a lot of backtracking and stuff. I have profiled the resulting program and found that the regex that converts float/real numbers into "N" is particularly heinously slow.

I'd like to do this with a state machine, character-by-character, for a one-pass solution. I could do this in C. But that's what regexes are for, right? State machines that run char-by-char in C.

I need to make this as fast as possible, even at the expense of memory. This is really critical for the log analysis. What thoughts do you have on this problem?


In reply to In search of an efficient query abstractor by xaprb

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.