comment on

Not only does using /x make things a lot more readable, it also helps with debugging. By commenting out everything except the first element in the final regex, it allowed me to adjust that until it worked for all (both:) test lines. Then I uncommented the next element and adjusted that and so on until the whole thing matched.

Using named sub elements allows you to re-use thise bits where necessary and would simplify adding in predefined elements like a better IP definition from regexp::Common or a datetime from somewhere.

#! perl -slw
use strict;

my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2}
+ ]x; 
# Aug 21 19:00:36
my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \] /x;                       
# [1.1.1.3.200.125]
my $re_msgid = qr[ \d{6} : ]x;                                    # 41
+0381:
my $re_TZ = qr[ [A-Z]{3} : ]x;                                    # UT
+C:
my $re_type = qr[ %SEC-6- [A-Z]+ : ]x;                            
# %SEC-6-IPACCESSLOGP:
my $re_listid = qr[ list \s (\d+) ]x;                             # li
+st 101
my $re_action = qr[ [a-z]+ ]x;                                    # de
+nied
my $re_protocol = qr[ [a-z]+ ]x;                                  # tc
+p
my $re_ip       = qr[ \d+ (?: \. \d+ ){3} ]x;                     # 10
+.161.24.153
my $re_port     = qr[ \( (\d+ (?: / \d+ )? ) \) ]x;               # (3
+988) or (8/0)
my $re_packets  = qr[ , \s+ ( \d+ ) \s+ packet ]x;                # , 
+1 packet

my $re_log = qr[
    ^
    ( $re_datetime ) \s+
    ( $re_MIB )      \s+
    ( $re_msgid )    \s+
    ( $re_datetime)  \s+
    ( $re_TZ )       \s+
      $re_type       \s+
      $re_listid     \s+
    ( $re_action )   \s+
    ( $re_protocol ) \s+
    ( $re_ip )       \s*
      $re_port?      \s+
      ->             \s+
    ( $re_ip )       \s*
      $re_port?
      $re_packets    \s*
      $
]x;

while( <DATA> ) {
    print join'|', $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, 
+$13
        if $_ =~ m[$re_log];
}

=pod output
P:\test>285616
Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den
+ied|tcp|10.161.24.153|3988|10.158.24.10|135|1
Use of uninitialized value in join or string at P:\test\285616.pl8 lin
+e 37, <DATA> line 2.
Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den
+ied|icmp|10.165.4.150||211.95.79.233|8/0|1
=cut


__DATA__
Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6-
+IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10
+(135), 1 packet
Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6-
+IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/
+0), 1 packet
[download]

Note that the second line produces an "uninitialised value" warning for the second line. This is because that line has no port number after the first IP number. This will result in all the capture numbers thereafter being shifted, which is a pain.

The best way I know of to avoid all the conditionals and stuff required to deal with regexes that contain conditional captures is to capture to named variables using (?{ }) extended regex feature.

#! perl -slw
use strict;
use re 'eval';

# Aug 21 19:00:36
my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2}
+ ]x;
my $re_MIB      = qr/ \[ \d (?: \. \d+ )+ \   # [1.1.1.3.200.125]
my $re_msgid    = qr[ \d{6} : ]x;                                 # 41
+0381:
my $re_TZ       = qr[ [A-Z]{3} : ]x;                              # UT
+C:
my $re_type     = qr[ %SEC-6- [A-Z]+ : ]x;    #%SEC-6-IPACCESSLOGP:
my $re_listid   = qr[ list \s (\d+) ]x;                           # li
+st 101
my $re_action   = qr[ [a-z]+ ]x;                                  # de
+nied
my $re_protocol = qr[ [a-z]+ ]x;                                  # tc
+p
my $re_ip       = qr[ \d+ (?: \. \d+ ){3} ]x;                     # 10
+.161.24.153
my $re_port     = qr[ \( (\d+ (?: / \d+ )? ) \) ]x;               # (3
+988) or (8/0)
my $re_packets  = qr[ , \s+ ( \d+ ) \s+ packet ]x;                # , 
+1 packet

my $re_log = qr[
    ^
    ( $re_datetime ) \s+  (?{ $first_date  = $^N||'' })
    ( $re_MIB )      \s+  (?{ $MIB         = $^N||'' })
    ( $re_msgid )    \s+  (?{ $msgID       = $^N||'' })
    ( $re_datetime)  \s+  (?{ $second_date = $^N||'' })
    ( $re_TZ )       \s+  (?{ $TZ          = $^N||'' })
      $re_type       \s+
      $re_listid     \s+  (?{ $listID      = $^N||'' })
    ( $re_action )   \s+  (?{ $action      = $^N||'' })
    ( $re_protocol ) \s+  (?{ $protocol    = $^N||'' })
    ( $re_ip )       \s*  (?{ $ip1         = $^N||'' })
      $re_port?      \s+  (?{ $port        = $^N||'' })
      ->             \s+
    ( $re_ip )       \s*  (?{ $ip2         = $^N||'' })
      $re_port?           (?{ $port2       = $^N||'' })
      $re_packets    \s*  (?{ $packets     = $^N||'' })
      $
]x;

while( <DATA> ) {
    our( $first_date, $MIB, $msgID, $second_date, $TZ, $listID,
        $action, $protocol, $ip1, $port, $ip2, $port2, $packets );
        
    print join'|', $first_date, $MIB, $msgID, $second_date, $TZ, $list
+ID,
                   $action, $protocol, $ip1, $port, $ip2, $port2, $pac
+kets
        if $_ =~ m[$re_log];
}

=pod output
P:\test>285616
Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den
+ied|tcp|10.161.24.153|3988|10.158.24.10|135|1
Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den
+ied|icmp|10.165.4.150|10.165.4.150|211.95.79.233|8/0|1

=cut


__DATA__
Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6-
+IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10
+(135), 1 packet
Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6-
+IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/
+0), 1 packet
[download]

Which I like because it avoids the capture variable shuffling and if you start using this approach consistantly, it becomes pretty much second nature to build regexes this way. The downsides are the "experimental" status of the "zero-width evaluation asserion" (Phew! What a handle:) and the need to use re 'eval'; both of which are frowned upon in some circles.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

In reply to Re: Cisco Log Files: broken REGEX (two solutions) by BrowserUk
in thread Cisco Log Files: broken REGEX by blue_cowdawg

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.