in reply to Re: Runtime Regexp Generation
in thread Runtime Regexp Generation

the data is generated on the fly, it's a packet capture of traffic on a network segment at any given moment. The data isn't already there... I'm collecting, crunching, and producing output from the data all in one go.

Replies are listed 'Best First'.
Re: I agree, but...
by BrowserUk (Patriarch) on Apr 14, 2003 at 17:40 UTC

    The major pain with trying to select records using regexes is that you have to try and match the whole record instead of just the fields that you are selecting on, hence your difficulties with specifiying the logical select "anything except this". The second problem is that of having your regex match against data in another part of the record than the field that you are interested in.

    By imposing some structure on your data--ie. making the fields in the record fixed length--and matching or rejecting on a field-by-field basis rather than trying to match (or not) a whole record at a time, you greatly simplify the process. This is what you would get by moving your data into a flat file DB and using DBI to perform your queries.

    At the very least, you should consider fixing the length of the fields of your records. You could then use substr as an lvalue in conjunction with a regex to greatly simplify the process of your queries. Eg.

    if (substr($record, 0, 10) =~ $src_ip_of_interest and substr($record, 10, 10) =~ $dst_ip_of_interest and substr($record, 20, 4) =~ $proto_of_interest and substr($record, 24, 6) !~ $src_port_of_disinterest # etc ... ) { #we found a record that matches the query }

    I think that you can see how much this simplifies the regexes involved. Generating conditionals using this form and using eval to execute them would be much simpler than trying to come up with a generic regex generator.

    That said, using BerkleyDB or similar in conjunction with DBI::* would be considerably easier to code and probably much quicker in performance.


    Examine what is said, not who speaks.
    1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
    2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
    3) Any sufficiently advanced technology is indistinguishable from magic.
    Arthur C. Clarke.

      It appears to me that there is no need to make the fields fixed-length since the fields appear to never contain whitespace and always to be separated by whitespace so it is not that hard to build a regex that matches exactly as desired.

                      - tye
        I wrote a generic log-reader combined with a custom query language (looked like SQL, and I based it on Parse::RecDescent) several years ago. It supported reading fields from comma(or anything else)-separated records, regex-separated records, or a custom regex. Once an object was defined, each read would return an array wanted fields from the next line, that were fed to precompiled query for evaluation.

        I don't have the code handy, but this solution can be abstracted very nicely. One gotcha is that it was quite slow, esp. if regular expressions were used to extract values from each line (I'm talking hundreds of megabytes of logs).