in reply to Cisco Log Files: broken REGEX

Not only does using /x make things a lot more readable, it also helps with debugging. By commenting out everything except the first element in the final regex, it allowed me to adjust that until it worked for all (both:) test lines. Then I uncommented the next element and adjusted that and so on until the whole thing matched.

Using named sub elements allows you to re-use thise bits where necessary and would simplify adding in predefined elements like a better IP definition from regexp::Common or a datetime from somewhere.

#! perl -slw use strict; my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2} + ]x; # Aug 21 19:00:36 my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \] /x; # [1.1.1.3.200.125] my $re_msgid = qr[ \d{6} : ]x; # 41 +0381: my $re_TZ = qr[ [A-Z]{3} : ]x; # UT +C: my $re_type = qr[ %SEC-6- [A-Z]+ : ]x; # %SEC-6-IPACCESSLOGP: my $re_listid = qr[ list \s (\d+) ]x; # li +st 101 my $re_action = qr[ [a-z]+ ]x; # de +nied my $re_protocol = qr[ [a-z]+ ]x; # tc +p my $re_ip = qr[ \d+ (?: \. \d+ ){3} ]x; # 10 +.161.24.153 my $re_port = qr[ \( (\d+ (?: / \d+ )? ) \) ]x; # (3 +988) or (8/0) my $re_packets = qr[ , \s+ ( \d+ ) \s+ packet ]x; # , +1 packet my $re_log = qr[ ^ ( $re_datetime ) \s+ ( $re_MIB ) \s+ ( $re_msgid ) \s+ ( $re_datetime) \s+ ( $re_TZ ) \s+ $re_type \s+ $re_listid \s+ ( $re_action ) \s+ ( $re_protocol ) \s+ ( $re_ip ) \s* $re_port? \s+ -> \s+ ( $re_ip ) \s* $re_port? $re_packets \s* $ ]x; while( <DATA> ) { print join'|', $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, +$13 if $_ =~ m[$re_log]; } =pod output P:\test>285616 Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den +ied|tcp|10.161.24.153|3988|10.158.24.10|135|1 Use of uninitialized value in join or string at P:\test\285616.pl8 lin +e 37, <DATA> line 2. Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den +ied|icmp|10.165.4.150||211.95.79.233|8/0|1 =cut __DATA__ Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10 +(135), 1 packet Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/ +0), 1 packet

Note that the second line produces an "uninitialised value" warning for the second line. This is because that line has no port number after the first IP number. This will result in all the capture numbers thereafter being shifted, which is a pain.

The best way I know of to avoid all the conditionals and stuff required to deal with regexes that contain conditional captures is to capture to named variables using (?{ }) extended regex feature.

#! perl -slw use strict; use re 'eval'; # Aug 21 19:00:36 my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2} + ]x; my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \ # [1.1.1.3.200.125] my $re_msgid = qr[ \d{6} : ]x; # 41 +0381: my $re_TZ = qr[ [A-Z]{3} : ]x; # UT +C: my $re_type = qr[ %SEC-6- [A-Z]+ : ]x; #%SEC-6-IPACCESSLOGP: my $re_listid = qr[ list \s (\d+) ]x; # li +st 101 my $re_action = qr[ [a-z]+ ]x; # de +nied my $re_protocol = qr[ [a-z]+ ]x; # tc +p my $re_ip = qr[ \d+ (?: \. \d+ ){3} ]x; # 10 +.161.24.153 my $re_port = qr[ \( (\d+ (?: / \d+ )? ) \) ]x; # (3 +988) or (8/0) my $re_packets = qr[ , \s+ ( \d+ ) \s+ packet ]x; # , +1 packet my $re_log = qr[ ^ ( $re_datetime ) \s+ (?{ $first_date = $^N||'' }) ( $re_MIB ) \s+ (?{ $MIB = $^N||'' }) ( $re_msgid ) \s+ (?{ $msgID = $^N||'' }) ( $re_datetime) \s+ (?{ $second_date = $^N||'' }) ( $re_TZ ) \s+ (?{ $TZ = $^N||'' }) $re_type \s+ $re_listid \s+ (?{ $listID = $^N||'' }) ( $re_action ) \s+ (?{ $action = $^N||'' }) ( $re_protocol ) \s+ (?{ $protocol = $^N||'' }) ( $re_ip ) \s* (?{ $ip1 = $^N||'' }) $re_port? \s+ (?{ $port = $^N||'' }) -> \s+ ( $re_ip ) \s* (?{ $ip2 = $^N||'' }) $re_port? (?{ $port2 = $^N||'' }) $re_packets \s* (?{ $packets = $^N||'' }) $ ]x; while( <DATA> ) { our( $first_date, $MIB, $msgID, $second_date, $TZ, $listID, $action, $protocol, $ip1, $port, $ip2, $port2, $packets ); print join'|', $first_date, $MIB, $msgID, $second_date, $TZ, $list +ID, $action, $protocol, $ip1, $port, $ip2, $port2, $pac +kets if $_ =~ m[$re_log]; } =pod output P:\test>285616 Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den +ied|tcp|10.161.24.153|3988|10.158.24.10|135|1 Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den +ied|icmp|10.165.4.150|10.165.4.150|211.95.79.233|8/0|1 =cut __DATA__ Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10 +(135), 1 packet Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/ +0), 1 packet

Which I like because it avoids the capture variable shuffling and if you start using this approach consistantly, it becomes pretty much second nature to build regexes this way. The downsides are the "experimental" status of the "zero-width evaluation asserion" (Phew! What a handle:) and the need to use re 'eval'; both of which are frowned upon in some circles.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.