Not only does using /x make things a lot more readable, it also helps with debugging. By commenting out everything except the first element in the final regex, it allowed me to adjust that until it worked for all (both:) test lines. Then I uncommented the next element and adjusted that and so on until the whole thing matched.

Using named sub elements allows you to re-use thise bits where necessary and would simplify adding in predefined elements like a better IP definition from regexp::Common or a datetime from somewhere.

#! perl -slw use strict; my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2} + ]x; # Aug 21 19:00:36 my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \] /x; # [1.1.1.3.200.125] my $re_msgid = qr[ \d{6} : ]x; # 41 +0381: my $re_TZ = qr[ [A-Z]{3} : ]x; # UT +C: my $re_type = qr[ %SEC-6- [A-Z]+ : ]x; # %SEC-6-IPACCESSLOGP: my $re_listid = qr[ list \s (\d+) ]x; # li +st 101 my $re_action = qr[ [a-z]+ ]x; # de +nied my $re_protocol = qr[ [a-z]+ ]x; # tc +p my $re_ip = qr[ \d+ (?: \. \d+ ){3} ]x; # 10 +.161.24.153 my $re_port = qr[ \( (\d+ (?: / \d+ )? ) \) ]x; # (3 +988) or (8/0) my $re_packets = qr[ , \s+ ( \d+ ) \s+ packet ]x; # , +1 packet my $re_log = qr[ ^ ( $re_datetime ) \s+ ( $re_MIB ) \s+ ( $re_msgid ) \s+ ( $re_datetime) \s+ ( $re_TZ ) \s+ $re_type \s+ $re_listid \s+ ( $re_action ) \s+ ( $re_protocol ) \s+ ( $re_ip ) \s* $re_port? \s+ -> \s+ ( $re_ip ) \s* $re_port? $re_packets \s* $ ]x; while( <DATA> ) { print join'|', $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, +$13 if $_ =~ m[$re_log]; } =pod output P:\test>285616 Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den +ied|tcp|10.161.24.153|3988|10.158.24.10|135|1 Use of uninitialized value in join or string at P:\test\285616.pl8 lin +e 37, <DATA> line 2. Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den +ied|icmp|10.165.4.150||211.95.79.233|8/0|1 =cut __DATA__ Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10 +(135), 1 packet Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/ +0), 1 packet

Note that the second line produces an "uninitialised value" warning for the second line. This is because that line has no port number after the first IP number. This will result in all the capture numbers thereafter being shifted, which is a pain.

The best way I know of to avoid all the conditionals and stuff required to deal with regexes that contain conditional captures is to capture to named variables using (?{ }) extended regex feature.

#! perl -slw use strict; use re 'eval'; # Aug 21 19:00:36 my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2} + ]x; my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \ # [1.1.1.3.200.125] my $re_msgid = qr[ \d{6} : ]x; # 41 +0381: my $re_TZ = qr[ [A-Z]{3} : ]x; # UT +C: my $re_type = qr[ %SEC-6- [A-Z]+ : ]x; #%SEC-6-IPACCESSLOGP: my $re_listid = qr[ list \s (\d+) ]x; # li +st 101 my $re_action = qr[ [a-z]+ ]x; # de +nied my $re_protocol = qr[ [a-z]+ ]x; # tc +p my $re_ip = qr[ \d+ (?: \. \d+ ){3} ]x; # 10 +.161.24.153 my $re_port = qr[ \( (\d+ (?: / \d+ )? ) \) ]x; # (3 +988) or (8/0) my $re_packets = qr[ , \s+ ( \d+ ) \s+ packet ]x; # , +1 packet my $re_log = qr[ ^ ( $re_datetime ) \s+ (?{ $first_date = $^N||'' }) ( $re_MIB ) \s+ (?{ $MIB = $^N||'' }) ( $re_msgid ) \s+ (?{ $msgID = $^N||'' }) ( $re_datetime) \s+ (?{ $second_date = $^N||'' }) ( $re_TZ ) \s+ (?{ $TZ = $^N||'' }) $re_type \s+ $re_listid \s+ (?{ $listID = $^N||'' }) ( $re_action ) \s+ (?{ $action = $^N||'' }) ( $re_protocol ) \s+ (?{ $protocol = $^N||'' }) ( $re_ip ) \s* (?{ $ip1 = $^N||'' }) $re_port? \s+ (?{ $port = $^N||'' }) -> \s+ ( $re_ip ) \s* (?{ $ip2 = $^N||'' }) $re_port? (?{ $port2 = $^N||'' }) $re_packets \s* (?{ $packets = $^N||'' }) $ ]x; while( <DATA> ) { our( $first_date, $MIB, $msgID, $second_date, $TZ, $listID, $action, $protocol, $ip1, $port, $ip2, $port2, $packets ); print join'|', $first_date, $MIB, $msgID, $second_date, $TZ, $list +ID, $action, $protocol, $ip1, $port, $ip2, $port2, $pac +kets if $_ =~ m[$re_log]; } =pod output P:\test>285616 Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den +ied|tcp|10.161.24.153|3988|10.158.24.10|135|1 Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den +ied|icmp|10.165.4.150|10.165.4.150|211.95.79.233|8/0|1 =cut __DATA__ Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10 +(135), 1 packet Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6- +IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/ +0), 1 packet

Which I like because it avoids the capture variable shuffling and if you start using this approach consistantly, it becomes pretty much second nature to build regexes this way. The downsides are the "experimental" status of the "zero-width evaluation asserion" (Phew! What a handle:) and the need to use re 'eval'; both of which are frowned upon in some circles.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.


In reply to Re: Cisco Log Files: broken REGEX (two solutions) by BrowserUk
in thread Cisco Log Files: broken REGEX by blue_cowdawg

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.