ACiD has asked for the wisdom of the Perl Monks concerning the following question:

Monks, et al.

Been browsing the monk site for awhile now...GREAT HELP! Thanks.
I am all new to Perl, so any help is greatly appreciated. I have several syslog files that I am parsing. Others are done because the fields are static, however, this one is killing me. I open the file and in a while loop, parse each line (till eof). Each line looks like the following (from snort ids):
May 16 11:17:12 system app: [id:id:id] Message [Class] [Priority:] {TCP} 111.1.1.64:1863 -> 222.2.2.155:55527

Here's what I have so far (from trying this and that):
if ($line =~ m/^(\w{3}) \s+ (\d+) \s (\d+\:\d+\:\d+) \s (\S+)\s snort [\[(\d+\:\d+\:\d+)\]]*\:\s+(.+)/ox)

$1 through $4 come out OK, but then $5 scoops up till end of line. (please forgive the poor code)

My problem is I cannot match the variable length field between [id:id:id] and {TCP}. This area will be variable based on {ICMP}, {TCP/UDP}, or error, which I will if statement out when I get this step resolved.

Any help is appreciated.

Replies are listed 'Best First'.
(jeffa) Re: Variable Length Parsing
by jeffa (Bishop) on Jun 08, 2003 at 21:00 UTC
    Mostly, you are splitting up your line by whitespace, and split is a good tool for that job. But ... this variable length field you speak of throws a monkey wrench in the gears. Here is a regex that should do the trick:
    use strict; use warnings; use Data::Dumper; my $str = 'May 16 11:17:12 system app: [id:id:id] Message [Class] [Pri +ority:] {TCP} 111.1.1.64:1863 -> 222.2.2.155:55527'; # no need to use $1 $2 etc. when you can name them ;) my ($mon,$day,$time,$app,$id,$msg,$proto,@ip) = $str =~ / # example match (\S+)\s+ # May (\S+)\s+ # 16 (\S+)\s+ # 11:17:12 ([^:]+):\s+ # system app (\S+)\s+ # [id:id:id] ([^\{]+) # Message [Class] [Priority] (\{[^\}]+\})\s+ # {TCP} (\S+) # 111.1.1.64:1863 \s+\-\>\s+ # (skip ->) (\S+) # 222.2.2.155:55527 /x; print "$mon, $day, $time, $app, $id, $msg, $proto, @ip\n";
    The important concept to take away from this is negation: (\S+) matches anything but whitespace, ([^:]+) matches anything up to a colon. I stored the two IP addresses in an array, your mileage will vary. Also, (update here) any complex regex such as this one should use the x modifier so it can be segmented and commented.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      All, Thank you kindly. I tried several of your tactics and got it to break off at the {TCP} (still have more work though. I did not know you could use ^blah in mid stream. I thought it was "beginning of line" (opposite $). I assume that since I () paren'd it, then it WAS the beginning of the next undefined $4 (or whatever number) ?? (again forgive the ignorance with respect to the code and language.)
        Yes, ^ can be used to negate a character class as well as anchor a match - it was not the parens that changed context, it was the square brackets (which denote a character class). Check out the docs, in particular, the sentence below xdigit.

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: Variable Length Parsing
by tcf22 (Priest) on Jun 08, 2003 at 20:40 UTC
    Well at the end you have (.+) and by default perl will match as much as it possibly can(ie. the rest of the string). I believe what you want to do is match up until {TCP} so try putting {TCP} after the  (.+).
Re: Variable Length Parsing
by CombatSquirrel (Hermit) on Jun 08, 2003 at 20:56 UTC
    Not knowing what characters in your example are meta characters (for example, I can't find the 'snort' in the example; maybe you could give some actual ones?), I would suppose that the regex is not quite doing what you think it does. The part [\[(\d+\:\d+\:\d+)\]]* is actually equivalent to [\[(d+:)\]]*. It is a character class of "]", "[", "(", ")", "d", ":" and "+" which matches zero or more times (greedy matching). Since the next character is a digit, which is not contained in this character class, it will match 0 times; the regex will then match some more characters until (.+) will slurp up the whole rest.
    Although I don't know the format, I would recommend you to change the mentioned part to (?:[\(\d+\:\d+\:\d+)\])*, which is probably more along the lines of what you intended to do.
    Note that [] stands for a character class, whereas (?:) are non-capturing parens.
    Hope that helped.
Re: Variable Length Parsing
by TomDLux (Vicar) on Jun 08, 2003 at 20:40 UTC

    You're trying to break this into words, right? Why not use split()?

    my ( @words ) = split /\s/, $line;
      That will find you an empty string between each of a series of consecutive whitespace characters. You probably want \s+ unless you're really splitting fields separated by exactly one blank each.

      Makeshifts last the longest.

Re: Variable Length Parsing
by pzbagel (Chaplain) on Jun 08, 2003 at 20:44 UTC

    Two things you should do. Switch .+ to it's non-greedy version(.+?). Second, If you know the variable length field will be terminated with TCP/UDP/ICMP/ERROR then create an alternation that states that in your regex to ensure that $5 doesn't suck up everthing till the end of the line:

    /(TCP|UDP|ICMP|IP|ERROR)/

    HTH

Re: Variable Length Parsing
by tos (Deacon) on Jun 08, 2003 at 20:48 UTC
    Hi,

    probably an alternation like this

    # perl -we '$x="somestring 12345 {ICMP}";$x=~/(.+({TCP}|{ICMP}))/ && p +rint "$1\n"' somestring 12345 {ICMP}
    could help you.