regex issues with /gc in log analysis...

EvanK has asked for the wisdom of the Perl Monks concerning the following question:

i'm writing a log parser, and am trying to optimize the way i match valid entries in the logfile. its a logfile for a multiplayer fps (F.E.A.R.), and its not as simple as most logs ive worked with, because the data i want (info for a given player after each round) spans multiple lines. this is a sample log for a single game:

[Thu Sep 21 17:48:38 2006] 
[Thu Sep 21 17:48:38 2006] ------------------------------------------
[Thu Sep 21 17:48:38 2006] Server started.
[Thu Sep 21 18:37:22 2006] Client connected: Alpha
[Thu Sep 21 18:37:31 2006] Client connected: Bravo
[Thu Sep 21 18:38:20 2006] Client connected: Charlie
[Thu Sep 21 18:39:18 2006] Client connected: Delta
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:53:36 2006] *** Results for Map: Worlds\ReleaseMultipla
+yer\Bypass
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:53:36 2006] Team: Team 1
[Thu Sep 21 18:53:36 2006] Score: 93
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:53:36 2006] Player: Alpha (uid: ad7023b7f46271acd31e1bd
+287613b6d)
[Thu Sep 21 18:53:36 2006] Score: 55
[Thu Sep 21 18:53:36 2006] Kills: 14
[Thu Sep 21 18:53:36 2006] Deaths: 15
[Thu Sep 21 18:53:36 2006] Team Kills: 0
[Thu Sep 21 18:53:36 2006] Suicides: 0
[Thu Sep 21 18:53:36 2006] Objective: 0
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:53:36 2006] Player: Bravo (uid: 5fdcc95043dc4dac9d7b4af
+b8469eb4f)
[Thu Sep 21 18:53:36 2006] Score: 38
[Thu Sep 21 18:53:36 2006] Kills: 11
[Thu Sep 21 18:53:36 2006] Deaths: 17
[Thu Sep 21 18:53:36 2006] Team Kills: 0
[Thu Sep 21 18:53:36 2006] Suicides: 0
[Thu Sep 21 18:53:36 2006] Objective: 0
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:53:36 2006] Team: Team 2
[Thu Sep 21 18:53:36 2006] Score: 135
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:53:36 2006] Player: Charlie (uid: e94839cae76debf1418ab
+9dfaa4c01e8)
[Thu Sep 21 18:53:36 2006] Score: 61
[Thu Sep 21 18:53:36 2006] Kills: 15
[Thu Sep 21 18:53:36 2006] Deaths: 14
[Thu Sep 21 18:53:36 2006] Team Kills: 0
[Thu Sep 21 18:53:36 2006] Suicides: 0
[Thu Sep 21 18:53:36 2006] Objective: 0
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:53:36 2006] Player: Delta (uid: b2ea959c1b3fa5c35ef6a6e
+576cdf2af)
[Thu Sep 21 18:53:36 2006] Score: 46
[Thu Sep 21 18:53:36 2006] Kills: 10
[Thu Sep 21 18:53:36 2006] Deaths: 4
[Thu Sep 21 18:53:36 2006] Team Kills: 0
[Thu Sep 21 18:53:36 2006] Suicides: 0
[Thu Sep 21 18:53:36 2006] Objective: 0
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:53:36 2006] 
[Thu Sep 21 18:57:47 2006] Client disconnected: Delta
[Thu Sep 21 18:58:01 2006] Client disconnected: Alpha
[Thu Sep 21 18:58:17 2006] Client disconnected: Bravo
[Thu Sep 21 18:59:03 2006] Client disconnected: Charlie
[download]

big and unwieldy, i know, and this is just one round. i have a real logfile that spans about a month, its huuuuuge. now, i have a working parser for it, using a search & replace regex that 'nibbles' away at the log:

note: since the server may be windows or linux, i'm explicitly matching both LF and CRLF newlines, hence the \x0D?\x0A at the end of the line


# loop through, 'nibbling' at logfile with every
# successful match & replacement
while(
  $log =~ s{
  \[ ([^\]]+?) \] \s+ Player\: \s+ ([^\x20]+?) \s+ \( uid: \s+ (\w+) \
+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Score: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Kills: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Deaths: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Team \s+ Kills: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Suicides: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Objective: \s+ (\d+)  \x0D?\x0A
  }{}ixms
) {

  # grab values from regex backreferences
  $timestamp = $1;
  $player    = $2;
  $uid       = $3;
  $score     = $4;
  $kills     = $5;
  $deaths    = $6;
  $teamkills = $7;
  $suicides  = $8;
  $objective = $9;
  
  # ...process log entry...
}
[download]

this works, its just slow and rather inefficient, not to mention it will likely be a beast to maintain.

i was trying to change to using a regular match with a /gc flag because it would be much more efficient, and it would also allow me to bypass the backreferences altogether:


# loop through, grabbing values for every
# successful match with the /gc flag
while(
  ($timestamp,$player,$uid,$score,$kills,$deaths,$teamkills,$suicides,
+$objective) = 
  $log =~ m{ \G .+
  \[ ([^\]]+?) \] \s+ Player\: \s+ ([^\x20]+?) \s+ \( uid: \s+ (\w+) \
+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Score: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Kills: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Deaths: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Team \s+ Kills: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Suicides: \s+ (\d+)  \x0D?\x0A
  \[ [^\]]+? \] \s+ Objective: \s+ (\d+)  \x0D?\x0A
  }ixmsgc
) {
  
  # ...process log entry...
}
[download]

per my understanding, it should match \G as the beginning of the string on the first time through, then match it as the end of the previous match from then on...however, it only ever matches one time.

am i missing something blatantly obvious? or is there even a cleaner approach that i havent thought of?

__________
Build a man a fire, and he'll be warm for a day. Set a man on fire, and he'll be warm for the rest of his life.
- Terry Pratchett

Comment on regex issues with /gc in log analysis... Select or Download Code

Replies are listed 'Best First'.
Re: regex issues with /gc in log analysis... by Errto (Vicar) on Oct 27, 2006 at 01:14 UTC
It looks like the problem is this part of your pattern: `.+`. This, being greedy, will eat up everything up to the last possible occurrance of the rest of your pattern. You actually shouldn't need either the \G anchor or that `.+` part because the /g flag means that each time through the loop it will pick up where it left off after the last match. Update: Also, you should be able to use plain old \n instead of that funky `\x0D?\x0A` so long as the log is generated on the same OS you're running the Perl program on. If not, just run `dos2unix` or `unix2dos` as appropriate.	[reply] [d/l] [select]
Re: regex issues with /gc in log analysis... by graff (Chancellor) on Oct 27, 2006 at 03:09 UTC
Your regexes make it look like you are reading large portions of the log at one time (or maybe even slurping it?). Why? If the log format is reliably consistent with the example you gave, a simple line-by-line reading should be plenty easy, quick and maintainable; and it looks like the kind of situation where a hash data structure would be handy: #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @fieldnames = ( 'Team', 'Score', 'Player', 'Kills', 'Deaths', 'Team Kills', 'Suicides', 'Objective', 'Results for +Map' ); my $matchfield = join '\|', @fieldnames; my %entry; my ( $map, $team, $player, $datafor ); while (<DATA>) { s/[\r\n]+//; # OS-neutral alternative to "chomp" my ( $timestamp, $data ) = unpack( "A26 A", $_ ); next unless ( length( $data ) and $data =~ /$matchfield/ ); $timestamp =~ tr/[]//d; my ( $fieldname, $fieldval ) = split( /: /, $data, 2 ); $fieldname =~ s/^[\s]+//; if ( $fieldname eq 'Results for Map' ) { processEntry( \%entry ); %entry = (); $entry{map} = $fieldval; $entry{time} = $timestamp; } elsif ( $fieldname eq 'Team' ) { $datafor = $fieldname; $team = $fieldval; $entry{$team} = {}; } elsif ( $fieldname eq 'Player' ) { $datafor = $fieldname; my $uid; ( $player, $uid ) = ( $fieldval =~ /(\w+) .uid: (\w+)/ ); $entry{$team}{$player}{uid} = $uid; } elsif ( $fieldname eq 'Score' ) { if ( $datafor eq 'Team' ) { $entry{$team}{Score} = $fieldval; } else { $entry{$team}{$player}{Score} = $fieldval; } } else { $entry{$team}{$player}{$fieldname} = $fieldval; } } processEntry( \%entry ); sub processEntry { my $href = shift; return unless ( scalar keys %$href ); print Dumper $href; } __DATA__ [Thu Sep 21 17:48:38 2006] [Thu Sep 21 17:48:38 2006] ------------------------------------------ [Thu Sep 21 17:48:38 2006] Server started. [Thu Sep 21 18:37:22 2006] Client connected: Alpha [Thu Sep 21 18:37:31 2006] Client connected: Bravo [Thu Sep 21 18:38:20 2006] Client connected: Charlie [Thu Sep 21 18:39:18 2006] Client connected: Delta [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] *** Results for Map: Worlds\ReleaseMultipla +yer\Bypass [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Team: Team 1 [Thu Sep 21 18:53:36 2006] Score: 93 [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Player: Alpha (uid: ad7023b7f46271acd31e1bd +287613b6d) [Thu Sep 21 18:53:36 2006] Score: 55 [Thu Sep 21 18:53:36 2006] Kills: 14 [Thu Sep 21 18:53:36 2006] Deaths: 15 [Thu Sep 21 18:53:36 2006] Team Kills: 0 [Thu Sep 21 18:53:36 2006] Suicides: 0 [Thu Sep 21 18:53:36 2006] Objective: 0 [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Player: Bravo (uid: 5fdcc95043dc4dac9d7b4af +b8469eb4f) [Thu Sep 21 18:53:36 2006] Score: 38 [Thu Sep 21 18:53:36 2006] Kills: 11 [Thu Sep 21 18:53:36 2006] Deaths: 17 [Thu Sep 21 18:53:36 2006] Team Kills: 0 [Thu Sep 21 18:53:36 2006] Suicides: 0 [Thu Sep 21 18:53:36 2006] Objective: 0 [download]	[reply] [d/l]
Re: regex issues with /gc in log analysis... by bobf (Monsignor) on Oct 27, 2006 at 03:23 UTC
If the format is as consistent as it appears to be, you could process each line at a time. This approach might be significantly faster since it uses simpler patterns rather than one long, complicated one (I haven't benchmarked it), but IMO the real win is in how much easier it would be to maintain. use strict; use warnings; my %playerdata; while( my $line = <DATA> ) { if( $line =~ m/\[(.+?)] Player: (.+?) $uid: (.+?)$/ ) { $playerdata{timestamp} = $1; $playerdata{player} = $2; $playerdata{uid} = $3; } foreach my $stat ( qw( Score Kills Deaths Suicides ), 'Team Kills' + ) { if( $line =~ m/] $stat: (\d+)/ ) { $playerdata{$stat} = $1; last; } } if( $line =~ m/Objective: (\d+)/ ) { $playerdata{Objective} = $1; # ...do something with log entry... %playerdata = (); } } __DATA__ [Thu Sep 21 17:48:38 2006] [Thu Sep 21 17:48:38 2006] ------------------------------------------ [Thu Sep 21 17:48:38 2006] Server started. [Thu Sep 21 18:37:22 2006] Client connected: Alpha [Thu Sep 21 18:37:31 2006] Client connected: Bravo [Thu Sep 21 18:38:20 2006] Client connected: Charlie [Thu Sep 21 18:39:18 2006] Client connected: Delta [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] *** Results for Map: Worlds\ReleaseMultipla +yer\Bypass [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Team: Team 1 [Thu Sep 21 18:53:36 2006] Score: 93 [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Player: Alpha (uid: ad7023b7f46271acd31e1bd +287613b6d) [Thu Sep 21 18:53:36 2006] Score: 55 [Thu Sep 21 18:53:36 2006] Kills: 14 [Thu Sep 21 18:53:36 2006] Deaths: 15 [Thu Sep 21 18:53:36 2006] Team Kills: 0 [Thu Sep 21 18:53:36 2006] Suicides: 0 [Thu Sep 21 18:53:36 2006] Objective: 0 [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Player: Bravo (uid: 5fdcc95043dc4dac9d7b4af +b8469eb4f) [Thu Sep 21 18:53:36 2006] Score: 38 [Thu Sep 21 18:53:36 2006] Kills: 11 [Thu Sep 21 18:53:36 2006] Deaths: 17 [Thu Sep 21 18:53:36 2006] Team Kills: 0 [Thu Sep 21 18:53:36 2006] Suicides: 0 [Thu Sep 21 18:53:36 2006] Objective: 0 [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Team: Team 2 [Thu Sep 21 18:53:36 2006] Score: 135 [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Player: Charlie (uid: e94839cae76debf1418ab +9dfaa4c01e8) [Thu Sep 21 18:53:36 2006] Score: 61 [Thu Sep 21 18:53:36 2006] Kills: 15 [Thu Sep 21 18:53:36 2006] Deaths: 14 [Thu Sep 21 18:53:36 2006] Team Kills: 0 [Thu Sep 21 18:53:36 2006] Suicides: 0 [Thu Sep 21 18:53:36 2006] Objective: 0 [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] Player: Delta (uid: b2ea959c1b3fa5c35ef6a6e +576cdf2af) [Thu Sep 21 18:53:36 2006] Score: 46 [Thu Sep 21 18:53:36 2006] Kills: 10 [Thu Sep 21 18:53:36 2006] Deaths: 4 [Thu Sep 21 18:53:36 2006] Team Kills: 0 [Thu Sep 21 18:53:36 2006] Suicides: 0 [Thu Sep 21 18:53:36 2006] Objective: 0 [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:53:36 2006] [Thu Sep 21 18:57:47 2006] Client disconnected: Delta [Thu Sep 21 18:58:01 2006] Client disconnected: Alpha [Thu Sep 21 18:58:17 2006] Client disconnected: Bravo [Thu Sep 21 18:59:03 2006] Client disconnected: Charlie [download] You could add a number of optimizations to this (like cleaning up the regexes: `(.+?)`, etc), but this should give you the idea. You could also add code to grab the fields in between the player blocks (for the team stats). You could also split the line into <timestamp> and <the rest> and process each independently. Finally, you could try to use paragraph mode ($/) to read in a whole record at a time, but I'm not sure that would help WRT maintenance. Update: looks like graff beat me to it. :-)	[reply] [d/l] [select]
Re: regex issues with /gc in log analysis... by EvanK (Chaplain) on Oct 28, 2006 at 00:59 UTC
wow, thanks! Errto was right, it was the greedy `.+` that was messing me up in the first place, but i've dropped that monster of a regex in favor of the line-by-line approach suggested by graff and bobf...it works like a charm, and much faster, too. i'm building an OO interface around this parser, so it will be up on the CPAN as a module once i finish. naturally, i'll note all the help i got here. and if there are any gamers here, i belong to a clan (NAEA) that has our own server, feel free to come join us. FEAR multiplayer is free :) __________ Build a man a fire, and he'll be warm for a day. Set a man on fire, and he'll be warm for the rest of his life. - Terry Pratchett	[reply] [d/l]