comment on

I've been reading this thread and the questions from BrowserUk. I probably don't completely understand the issues here, but I will interject a few thoughts that might help spur other thoughts...

I think that you want separate lines based upon the node and the audit information?
From what I understand, it could be that a node will spew out interleaved audit info representing 2 different events although your example data does not show that?

I was looking at your code and my first thought was, "hey, run the regex engine once!" instead of multiple times. The regex engine is wonderful, but "expensive" CPU wise when used in high volume parsing applications.

This code:

if (/^node=(\S+).*audit\((\d+\....):(\d+)\)/){}
if (/^node=(\S+) type=EOE msg=audit\((\d+\....):(\d+)\)/){}
[download]

invokes the regex engine twice. I wrote some code to do a regex only once, but deleted it because...

I do not see the need to use the regex engine at all! You have a very regularly formatted log file and the much more "light-weight" split on a single character can be used.

I hope my attached code will spur you on to modify your code and do some more benchmarks.

For the "time_out" code, I would only run that once per X minutes. Cycle through all entries in the %time hash to find anybody who hasn't been updated in the last 5 minutes as per your clock - no need to decode the "audit" time information. This is like a "garbage collection" process that happens occasionally (although that analogy doesn't apply to Perl itself).

This code produces a slightly different output than yours does. But you are apparently defining a new format and that might be ok?

I am very skeptical that multi-threads or multi-processes can help in this situation. At the end of the day, you are writing to a single disk file. There can be a lot of overhead in flushing buffers between CPU thread/process context switches. An attempt to make this mult-threaded may actually run much slower. Run benchmarks on efficient single threaded program as a first step before getting into a much more complicated execution scenario.

#!/usr/bin/perl
use strict;
use warnings;

my %time_out;  # last local time stamp for each %data key
my %data;

my $next_cleanup_due;  #a local time = 5 minutes from last cleanup

while (<DATA>)
{
    chomp;
    
    next unless (index($_,"=")>0); # skip blank or malformatted lines
                                   # must have at least one "=" sign
    
    my $current_time=time(); #  "epoch" time.
    
    # note: "nodef" means "node field", not a parsed node=aaaaaaaaaa
    # note: the use of 3rd field in the split() to limit the fields.
    
    my ($nodef, $typef, $auditf, $textf) = split(' ', $_, 4);
    $textf //= "";              # type=EOE has no logged text field
      
    push @{$data{"$nodef $auditf"}}, "$typef $textf";
    $time_out{"$nodef $auditf"}= $current_time; #local epoch time
    
    # use index() instead of regex for EOE detection
    
    dump_data ("$nodef $auditf") if (index($typef,"EOE")>0);
 
 ## see my posted text... re: time_out strategy and methods
 ## if ($current_time > $next_cleanup_due) #only every 5 minutes
 ## {
 ##    run through all key,values of %time and dump_data()
 ##    for all %time values that have "expired"...
 ##    probably need to add special kind of EOE message
 ## }
}

sub dump_data
{
    my $node_audit= shift;
    
    print "$node_audit ", join(" ",@{$data{$node_audit}}), "\n";
    
    #clean up hashes:
    
    delete $data{$node_audit};
    delete $time_out{$node_audit};
} 

=prints
node=aaaaaaaaaa msg=audit(1485583203.459:5485148): type=SYSCALL arch=c
+000003e syscall=59 success=no exit=-2 a0=7f30b9d87149 a1=7f30b9d86860
+ a2=7f30b9d86bd8 a3=7f30b9d9c8c0 items=1 ppid=xxxxx pid=xxxxx auid=xx
+xxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx egid=xxxxx 
+sgid=xxxxx fsgid=xxxxx tty=(none) ses=16439 comm="command" exe="/bin/
+ksh93" key="cmdlineExecution" type=CWD cwd="/a/cwd" type=PATH item=0 
+name="/etc/uname" nametype=UNKNOWN type=EOE 
node=xxxxxxxxxx msg=audit(1485583201.776:5485082): type=SYSCALL arch=c
+000003e syscall=82 per=400000 success=yes exit=0 a0=7fc164006990 a1=7
+fc164006b70 a2=7fc164006b70 a3=7fc230853278 items=4 ppid=xxxxx pid=xx
+xxx auid=xxxxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx 
+egid=xxxxx sgid=xxxxx fsgid=xxxxx tty=(none) ses=4294967295 comm="som
+ecommand" exe="/full/path/to/somecommand" key="delete" type=CWD cwd="
+/another/cwd" type=PATH item=0 name="arg-data-0" inode=268805 dev=fd:
+14 mode=040740 ouid=xxxxx ogid=xxxxx rdev=00:00 nametype=PARENT type=
+PATH item=1 name="arg-data-1" inode=268805 dev=fd:14 mode=040740 ouid
+=xxxxx ogid=xxxxx rdev=00:00 nametype=PARENT type=PATH item=2 name="a
+rg-data-2" inode=269256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=xxxxx 
+rdev=00:00 nametype=DELETE type=PATH item=3 name="arg-data-3" inode=2
+69256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=xxxxx rdev=00:00 nametyp
+e=CREATE type=EOE 
=cut

  
__DATA__
# this bad line here on purpose
node=xxxxxxxxxx type=SYSCALL msg=audit(1485583201.776:5485082): arch=c
+000003e syscall=82 per=400000 success=yes exit=0 a0=7fc164006990 a1=7
+fc164006b70 a2=7fc164006b70 a3=7fc230853278 items=4 ppid=xxxxx pid=xx
+xxx auid=xxxxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx 
+egid=xxxxx sgid=xxxxx fsgid=xxxxx tty=(none) ses=4294967295 comm="som
+ecommand" exe="/full/path/to/somecommand" key="delete"
node=xxxxxxxxxx type=CWD msg=audit(1485583201.776:5485082):  cwd="/ano
+ther/cwd"
node=aaaaaaaaaa type=SYSCALL msg=audit(1485583203.459:5485148): arch=c
+000003e syscall=59 success=no exit=-2 a0=7f30b9d87149 a1=7f30b9d86860
+ a2=7f30b9d86bd8 a3=7f30b9d9c8c0 items=1 ppid=xxxxx pid=xxxxx auid=xx
+xxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx egid=xxxxx 
+sgid=xxxxx fsgid=xxxxx tty=(none) ses=16439 comm="command" exe="/bin/
+ksh93" key="cmdlineExecution"
node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=0 na
+me="arg-data-0" inode=268805 dev=fd:14 mode=040740 ouid=xxxxx ogid=xx
+xxx rdev=00:00 nametype=PARENT
node=aaaaaaaaaa type=CWD msg=audit(1485583203.459:5485148):  cwd="/a/c
+wd"
node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=1 na
+me="arg-data-1" inode=268805 dev=fd:14 mode=040740 ouid=xxxxx ogid=xx
+xxx rdev=00:00 nametype=PARENT
node=aaaaaaaaaa type=PATH msg=audit(1485583203.459:5485148): item=0 na
+me="/etc/uname" nametype=UNKNOWN
node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=2 na
+me="arg-data-2" inode=269256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=x
+xxxx rdev=00:00 nametype=DELETE
node=aaaaaaaaaa type=EOE msg=audit(1485583203.459:5485148):
node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=3 na
+me="arg-data-3" inode=269256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=x
+xxxx rdev=00:00 nametype=CREATE
node=xxxxxxxxxx type=EOE msg=audit(1485583201.776:5485082):
[download]

Update:

An attempt clarify the timeout code...
Two important parms:

$next_cleanup_due, the next time() that "cleanup()" will run
$time_out, an individual "$node $audit" time_out

Basically, run the hash cleanup() code every so often. When that code runs, any entries that are "timed out" are output and internal structures updated. These variables are independent. Could be cleanup hash runs once per minute and individual time out is 5 minutes. Could be that the cleanup hash routine runs once per hour while individual time out is 5 minutes.

Something like this (untested):

## see my posted text...
 ## if ($current_time > $next_cleanup_due) #only every 5+ minutes
 ## {
 ##    run through all key,values of %time and dump_data()
 ##    for all %time values that have "expired"...
 ##    probably need to add special kind of EOE message
 ## }
 

cleanup() if ($current_time > $next_cleanup_due); 

sub cleanup
{
   $next_cleanup_due += 6*60; # 6 minutes from now

   foreach my $node_audit (keys %time_out)
   {
       if ($current_time > ($time_out{$node_audit} + $time_out))
       {
           push @{$data{$node_audit}}, "type=EXPIRED";
           dump_data ($node_audit);
       }
   }
}
[download]

I would expect that the code runs really fast until say once per minute or 5 minutes, it slows down to look at all it is working on and cleans up expired stuff. This "time out stuff" is relatively "expensive" CPU wise. However, I recommend that you benchmark something similar to my suggestions and see what happens.
Performance keys:

Minimize use of regex engine
use the most simple data structures possible - multi-dimensional hash can get "expensive"
don't look at the whole hash of "work in progress" more often than you need to.

In reply to Re: Multi-CPU when reading STDIN and small tasks by Marshall
in thread Multi-CPU when reading STDIN and small tasks by bspencer

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.