Variable Length Parsing

ACiD has asked for the wisdom of the Perl Monks concerning the following question:

Monks, et al.

Been browsing the monk site for awhile now...GREAT HELP! Thanks.
I am all new to Perl, so any help is greatly appreciated. I have several syslog files that I am parsing. Others are done because the fields are static, however, this one is killing me. I open the file and in a while loop, parse each line (till eof). Each line looks like the following (from snort ids):
May 16 11:17:12 system app: [id:id:id] Message [Class] [Priority:] {TCP} 111.1.1.64:1863 -> 222.2.2.155:55527

Here's what I have so far (from trying this and that):
if ($line =~ m/^(\w{3}) \s+ (\d+) \s (\d+\:\d+\:\d+) \s (\S+)\s snort [\[(\d+\:\d+\:\d+)\]]*\:\s+(.+)/ox)

$1 through $4 come out OK, but then $5 scoops up till end of line. (please forgive the poor code)

My problem is I cannot match the variable length field between [id:id:id] and {TCP}. This area will be variable based on {ICMP}, {TCP/UDP}, or error, which I will if statement out when I get this step resolved.

Any help is appreciated.

Comment on Variable Length Parsing Select or Download Code

Replies are listed 'Best First'.
(jeffa) Re: Variable Length Parsing by jeffa (Bishop) on Jun 08, 2003 at 21:00 UTC
Mostly, you are splitting up your line by whitespace, and split is a good tool for that job. But ... this variable length field you speak of throws a monkey wrench in the gears. Here is a regex that should do the trick: use strict; use warnings; use Data::Dumper; my $str = 'May 16 11:17:12 system app: [id:id:id] Message [Class] [Pri +ority:] {TCP} 111.1.1.64:1863 -> 222.2.2.155:55527'; # no need to use $1 $2 etc. when you can name them ;) my ($mon,$day,$time,$app,$id,$msg,$proto,@ip) = $str =~ / # example match (\S+)\s+ # May (\S+)\s+ # 16 (\S+)\s+ # 11:17:12 ([^:]+):\s+ # system app (\S+)\s+ # [id:id:id] ([^\{]+) # Message [Class] [Priority] (\{[^\}]+\})\s+ # {TCP} (\S+) # 111.1.1.64:1863 \s+\-\>\s+ # (skip ->) (\S+) # 222.2.2.155:55527 /x; print "$mon, $day, $time, $app, $id, $msg, $proto, @ip\n"; [download] The important concept to take away from this is negation: `(\S+)` matches anything but whitespace, `([^:]+)` matches anything up to a colon. I stored the two IP addresses in an array, your mileage will vary. Also, (update here) any complex regex such as this one should use the `x` modifier so it can be segmented and commented. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
Re: (jeffa) Re: Variable Length Parsing by ACiD (Novice) on Jun 08, 2003 at 21:51 UTC
All, Thank you kindly. I tried several of your tactics and got it to break off at the `{TCP}` (still have more work though. I did not know you could use `^blah` in mid stream. I thought it was "beginning of line" (opposite $). I assume that since I () paren'd it, then it WAS the beginning of the next undefined $4 (or whatever number) ?? (again forgive the ignorance with respect to the code and language.)	[reply] [d/l] [select]
(jeffa) 3Re: Variable Length Parsing by jeffa (Bishop) on Jun 09, 2003 at 00:58 UTC
Yes, ^ can be used to negate a character class as well as anchor a match - it was not the parens that changed context, it was the square brackets (which denote a character class). Check out the docs, in particular, the sentence below xdigit. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re: Variable Length Parsing by tcf22 (Priest) on Jun 08, 2003 at 20:40 UTC
Well at the end you have (.+) and by default perl will match as much as it possibly can(ie. the rest of the string). I believe what you want to do is match up until {TCP} so try putting `{TCP}` after the `(.+)`.	[reply] [d/l] [select]
Re: Variable Length Parsing by CombatSquirrel (Hermit) on Jun 08, 2003 at 20:56 UTC
Not knowing what characters in your example are meta characters (for example, I can't find the 'snort' in the example; maybe you could give some actual ones?), I would suppose that the regex is not quite doing what you think it does. The part `[\[(\d+\:\d+\:\d+)\]]` is actually equivalent to `[\[(d+:)\]]`. It is a character class of "]", "[", "(", ")", "d", ":" and "+" which matches zero or more times (greedy matching). Since the next character is a digit, which is not contained in this character class, it will match 0 times; the regex will then match some more characters until `(.+)` will slurp up the whole rest. Although I don't know the format, I would recommend you to change the mentioned part to `(?:[\(\d+\:\d+\:\d+)\])*`, which is probably more along the lines of what you intended to do. Note that `[]` stands for a character class, whereas `(?:)` are non-capturing parens. Hope that helped.	[reply] [d/l] [select]
Re: Variable Length Parsing by TomDLux (Vicar) on Jun 08, 2003 at 20:40 UTC
You're trying to break this into words, right? Why not use split()? `my ( @words ) = split /\s/, $line;` [download]	[reply] [d/l]
Re^2: Variable Length Parsing by Aristotle (Chancellor) on Jun 08, 2003 at 23:49 UTC
That will find you an empty string between each of a series of consecutive whitespace characters. You probably want `\s+` unless you're really splitting fields separated by exactly one blank each. Makeshifts last the longest.	[reply]
Re: Variable Length Parsing by pzbagel (Chaplain) on Jun 08, 2003 at 20:44 UTC
Two things you should do. Switch .+ to it's non-greedy version(.+?). Second, If you know the variable length field will be terminated with TCP/UDP/ICMP/ERROR then create an alternation that states that in your regex to ensure that $5 doesn't suck up everthing till the end of the line: `/(TCP\|UDP\|ICMP\|IP\|ERROR)/` [download] HTH	[reply] [d/l]
Re: Variable Length Parsing by tos (Deacon) on Jun 08, 2003 at 20:48 UTC
Hi, probably an alternation like this `# perl -we '$x="somestring 12345 {ICMP}";$x=~/(.+({TCP}\|{ICMP}))/ && p +rint "$1\n"' somestring 12345 {ICMP}` [download] could help you.	[reply] [d/l]