Runtime Regexp Generation

tekkie has asked for the wisdom of the Perl Monks concerning the following question:

Warning in advance: this is a somewhat long write-up... readmore has been utilized.

I have lines of text that look similar to this:

1 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 80 [SYN]
2 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 113 [SYN]
3 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 123 [SYN]
4 120 2.3.4.5 -> 5.4.3.2 ICMP ? > ? echo (ping) reply
5 120 2.3.4.5 -> 5.4.3.2 ICMP ? > ? echo (ping) request
6 120 2.3.4.5 -> 5.4.3.2 ICMP ? > ? echo (ping) reply
7 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 562 [RST]
8 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 36 [RST]
9 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 90 [RST]
[download]

For interested parties, the data comes from Ethereal packet capture frames, I extract the data from individual packets and create a summary report resembling the above, which is then filtered through an algorithm to detect incoming attacks based on packet signatures and thresholds (read as: IDS).

For these lines, a base regexp can be applied:
/\d+\s+(\d+)\s+(SOURCE_ADDR)\s+->\s+(DEST_ADDR)\s+(PROT)\s+(SOURCE_PORT)\s+>\s+(DEST_PORT)/

I want to be able to write a CGI script that will allow me to substitute out the capitalized pieces in the above regexp for user-inputted values (or matching non-whitespace if a value is omitted) and then return the lines that match the generated regexp

For example:

user input:
SOURCE_ADDR=1.2.3.4 PROT=TCP

would generate the regexp:
/\d+\s+(\d+)\s+(1.2.3.4)\s+->\s+(\S+)\s+(TCP)\s+(\S+)\s+>\s+(\S+)/

and would match lines 1-3,7-9 of the above data set.

However, I need a way to generate regexp's that will not match certain values, such as the following:

user input: PROT=!TCP
would match lines 4-6 above.

user input: PORT=!80
would match lines 2-9 above.

I'd like to be able to do this without using several if statements, but the perl negative regexp operators are look-behind/look-ahead, I need an operator that will "match anything not equal to."

Does such an operator exist? Is there some combination of look-ahead/look-behind that I could use to do what I want?

Here's the current code:

#!/usr/bin/perl -w

use strict;
use CGI;

use vars qw($data_file);
$data_file = 'data.txt';

{
    my $cgi = new CGI;
    my $custom_regexp = '\d+\s+(\d+)\s+(SOURCE_ADDR)\s+->\s+(DEST_ADDR
+)\s+(PROT)\s+(SOURCE_PORT)\s+>\s+(DEST_PORT)';
    my %user_param;
    
    $user_param{'dest_addr'}   = defined($cgi->param('dest_addr')) ? $
+cgi->param('dest_addr') : '\S+';
    $user_param{'source_addr'} = defined($cgi->param('source_addr')) ?
+ $cgi->param('source_addr') : '\S+';
    $user_param{'prot'}        = defined($cgi->param('prot')) ? $cgi->
+param('prot') : '\S+';
    $user_param{'source_port'} = defined($cgi->param('source_port')) ?
+ $cgi->param('source_port') : '\S+';
    $user_param{'dest_port'}   = defined($cgi->param('dest_port')) ? $
+cgi->param('dest_port') : '\S+';
        
    my $new_sig = $custom_regexp;
    
    foreach my $key (keys %user_param) { 
        if($user_param{$key} =~ /^!(.+?)$/) {
            $user_param{$key} = "?!$1)(\\S+)(?<!$1";
        }
    }
        
    $new_sig =~ s/SOURCE_ADDR/$user_param{'source_addr'}/;
    $new_sig =~ s/DEST_ADDR/$user_param{'dest_addr'}/;
    $new_sig =~ s/PROT/$user_param{'prot'}/;
    $new_sig =~ s/SOURCE_PORT/$user_param{'source_port'}/;
    $new_sig =~ s/DEST_PORT/$user_param{'dest_port'}/;
        
    print "$new_sig\n";
    
    open(DATA, "<$data_file");
        while(my $pkt = <DATA>) { print "$pkt" if $pkt =~ qr/$new_sig/
+; }
    close DATA;
}
[download]

But (?!$1)(\S+)(?<!$1) matches only so long as what I'm matching contains no whitespace, if it does, the \S+ doesn't match.

This may seem unimportant in the present application, but eventually, I'd like to be able to add a TCP_TYPE param, and those can resemble:

[SYN] or [SYN, ACK] (other values may be present besides SYN and ACK)

So using [(?!SYN)(\S+)(?<!SYN)] would fail on a [SYN, ACK] packet

Any help would be greatly appreciated, thank you for taking the time to read all of this, and thank you in advance for any replies that help me along to my goal.

Comment on Runtime Regexp Generation Select or Download Code

Replies are listed 'Best First'.
Re: Runtime Regexp Generation by perlguy (Deacon) on Apr 14, 2003 at 15:59 UTC
Combine them: `while (chomp($_ = <TOPARSE>)) { next unless /\d+\s+(\d+)\s+(1\.2\.3\.4)\s+->\s+(4\.3\.2\.1)\s+(?!TC +P)([^\s]+)\s+(3456)\s+>\s+(113)/; print; }` [download] That way, you get the best of both worlds. You just do a negative lookahead `(?!)` AND a protocol gulper (like `([^\s]+)`, and you can still search for all your other strings that you want to search for. This will match on source 1.2.3.4 and destination of 4.3.2.1, but will make sure it is not a TCP packet. Best of both worlds... Update: Per chromatic's and tye's recommendations, my proposed code would look like this: `while (<TOPARSE>) { next unless /^\d+\s+(\d+)\s+(1\.2\.3\.4)\s+->\s+(4\.3\.2\.1)\s+(?!T +CP\s)([^\s]+)\s+(3456)\s+>\s+(113)/; print; }` [download]	[reply] [d/l] [select]
Re: Re: Runtime Regexp Generation by chromatic (Archbishop) on Apr 14, 2003 at 16:51 UTC
Careful! That construct will not process the last line of certain files. chomp can return a false value.	[reply]
Re^2: Runtime Regexp Generation (anchor) by tye (Sage) on Apr 14, 2003 at 17:49 UTC
Now just add a ^ anchor at the front of the regex and you've got a nice solution for the regex (otherwise backtracking will be attempted, wasting time). Note that (?!TCP) only prevents matching against fields that start with "TCP". So you might want (?!TCP\s) instead (and don't chomp). - tye	[reply]
Re: Runtime Regexp Generation by hardburn (Abbot) on Apr 14, 2003 at 16:00 UTC
Variables are interpolated into a regex, so you don't need to do a s/THIS/THAT/ on your regex string. Also, its safer if you anchor your regex to the beginning of the string. Here's an example: `my $SOURCE_ADDR = param('source_addr') \|\| '.?'; my $DEST_ADDR = param('dest_addr') \|\| '.?'; my $PROTO = param('proto') \|\| '\w+'; my $SOURCE_PORT = param('source_port') \|\| '\d+'; my $DEST_PORT = param('dest_port') \|\| '\d+'; ... my $entry =~ /\A \d+\s+ (\d+)\s+ ($SOURCE_ADDR)\s+ ->\s+ ($DEST_ADDR)\s+ ($PROTO)\s+ ($SOURCE_PORT) \s+>\s+ ($DEST_PORT) /x;` [download] Note that if a user doesn't input one of the above checks, it defaults to matching any valid input for that field. The IP addresses match against '.?' because I'm too lazy to do it properly--I don't recommend doing it like that in a real program. Also, you might want to remove any whitespace in the user input. Users will likely put in '! TCP', which won't work the same as '!TCP', which is what you want. ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.* -- Schemer Note: All code is untested, unless otherwise stated	[reply] [d/l]
Re^2: Runtime Regexp Generation (! .*?) by tye (Sage) on Apr 14, 2003 at 18:05 UTC
Please use \S+ instead of .? so that backtracking won't waste CPU time. Please include \s after ($DEST_PORT) so that you won't match just a prefix of the destination port number. I'd add to the above what perlguy recommended: my %r= ( source_addr => '\S+', dest_addr => '\S+', proto => '\S+', source_port => '\d+', dest_port => '\d+', ); for my $param ( keys %r ) { my $value= $cgi->param($param) \|\| ""; if( $value =~ s/^!\s// ) { $r{$param}= "(?!\Q$value\E\s)($r{$param})"; } elsif( $value =~ /\S/ ) { $r{$param}= "(\Q$value\E)"; } else { $r{$param}= "($r{$param})"; } } my $regex= qr/ ^ \d+\s+ (\d+)\s+ $r{source_addr}\s+ ->\s+ $r{$dest_addr}\s+ $r{$proto}\s+ $r{$source_port} \s+>\s+ $r{$dest_port}\s+ /x; while( my $pkt= <DATA> ) { print $pkt if $pkt =~ $regex; } [download] - tye	[reply] [d/l]
Re: Runtime Regexp Generation by BrowserUk (Patriarch) on Apr 14, 2003 at 16:05 UTC
I'm not known for being one to say "don't use regexes for that" or "Use a database", but in this instance, what you are trying to devise is a Query Language. You may be able to acheive your aims with regexes, but moving your data into a DB and using SQL will undoubtedly save you considerable time and effort. Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply]
I agree, but... by tekkie (Beadle) on Apr 14, 2003 at 16:20 UTC
the data is generated on the fly, it's a packet capture of traffic on a network segment at any given moment. The data isn't already there... I'm collecting, crunching, and producing output from the data all in one go.	[reply]
Re: I agree, but... by BrowserUk (Patriarch) on Apr 14, 2003 at 17:40 UTC
The major pain with trying to select records using regexes is that you have to try and match the whole record instead of just the fields that you are selecting on, hence your difficulties with specifiying the logical select "anything except this". The second problem is that of having your regex match against data in another part of the record than the field that you are interested in. By imposing some structure on your data--ie. making the fields in the record fixed length--and matching or rejecting on a field-by-field basis rather than trying to match (or not) a whole record at a time, you greatly simplify the process. This is what you would get by moving your data into a flat file DB and using DBI to perform your queries. At the very least, you should consider fixing the length of the fields of your records. You could then use substr as an lvalue in conjunction with a regex to greatly simplify the process of your queries. Eg. `if (substr($record, 0, 10) =~ $src_ip_of_interest and substr($record, 10, 10) =~ $dst_ip_of_interest and substr($record, 20, 4) =~ $proto_of_interest and substr($record, 24, 6) !~ $src_port_of_disinterest # etc ... ) { #we found a record that matches the query }` [download] I think that you can see how much this simplifies the regexes involved. Generating conditionals using this form and using eval to execute them would be much simpler than trying to come up with a generic regex generator. That said, using BerkleyDB or similar in conjunction with DBI::* would be considerably easier to code and probably much quicker in performance. Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply] [d/l]
Re^4: Runtime Regexp Generation (I agree, but...) by tye (Sage) on Apr 14, 2003 at 18:09 UTC
Re: Re^4: Runtime Regexp Generation (I agree, but...) by dmitri (Priest) on Apr 14, 2003 at 22:38 UTC
Re: Runtime Regexp Generation by l2kashe (Deacon) on Apr 14, 2003 at 18:35 UTC
Personal preference is when regexen get this large to either A) build them in steps, or B) use something else.. here is a basic filter using basic logic and tests... #!/usr/bin/perl push(@foo, '1 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 80 [SYN]', '2 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 113 [SYN]', '3 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 123 [SYN]', '4 120 2.3.4.5 -> 5.4.3.2 ICMP ? > ? echo (ping) reply', '5 120 2.3.4.5 -> 5.4.3.2 ICMP ? > ? echo (ping) request', '6 120 2.3.4.5 -> 5.4.3.2 ICMP ? > ? echo (ping) reply', '7 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 562 [RST]', '8 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 36 [RST]', '9 60 1.2.3.4 -> 4.3.2.1 TCP 3456 > 90 [RST]', ); # assume that when split the fields are as follows.. #line = '0'; #size = '1'; #src = '2'; #dest = '4'; #proto = '5'; #port = '8'; # here is what we will test on.. this could be altered to be # collected via flags, shifted off of ARGV, or passed as # params to a CGI easily... print "proto: "; chomp(my $i_proto=<>); print "port: "; chomp(my $i_port=<>); # loop over our data set, this could just as easily be a # socket or filehandle.. for ( @foo ) { my @line = split(/\s+/); if ($i_proto) { (my $tmp = $i_proto) =~ s/^!//; if ($i_proto =~ /^!/) { next if ($line[5] =~ /$tmp/); } else { next if ($line[5] !~ /$tmp/); } } if ($i_port) { (my $tmp = $i_port) =~ s/^!//; if ($i_port =~ /^!/) { next if ($line[8] =~ /$tmp/); } else { next if ($line[8] !~ /$tmp/); } } print "$_\n"; } [download] I usually place a sample data line or 2 in my source file, so that people who come along after me know what elements im working on, or they can compare the data being passed to the code, vs the data the code is assuming it is receiving and go "duh.. we upgraded app X, need to alter the filter.." I know the question was how to get a regex to match, but personally in this situation, I think it might be better to move away from the regex, as it makes the code clearer and easier to maintain.. almost update: I guess you could also alter the split to only return the items you will ever search on, but I tend to attempt to not dictate what possible uses the code may have in the future.. A slightly better split might be something like `# @data now contains src_addr, dest_addr, proto, and port @data = ( split(/\s+/) )[2,4,5,8] # later test elem 3 instead of 5.. yada yada` [download] MMMMM... Chocolaty Perl Goodness.....	[reply] [d/l] [select]
Re: Runtime Regexp Generation (ngrep) by Aristotle (Chancellor) on Apr 14, 2003 at 21:21 UTC
Sounds like you're trying to reinvent ngrep to me. Makeshifts last the longest.	[reply]
Re: Runtime Regexp Generation by crenz (Priest) on Apr 15, 2003 at 15:10 UTC
Personally, I'd refrain from using one regex for all your querying -- I find it to hard to read. Why not just use `split()` and then use regexes on the individual fields? A sample loop for processing line by line: `$user_param{source_addr} \|\|= '.'; # default value for param $user_param{dest_addr} \|\|= '.'; while (<DATA>) { my @data = split(/\s/, $_); next unless $data[2] =~ $user_param{source_addr}; next unless $data[3] =~ $user_param{dest_addr}; # ... print $_; }` [download]	[reply] [d/l] [select]