bspencer has asked for the wisdom of the Perl Monks concerning the following question:

Background: A central syslog server receives syslog messages from a servers and filters messages of a certain type to a PERL script which in turn reads STDIN. The task of the script is to flatten the multi-line messages into a single line and write them out to disk. Because the lines may not all be related when they come in I use a hash index to relate all of the associated lines. Some lines have a "end of event" marker which is used to trigger then write while others may not and I time them out after 5 seconds.

Issue: Currently the script is single threaded and it's anticipated that this limit will be the first one reached since rsyslog is multithreaded. While I've read some about threading in PERL which could help they seem to be directed towards workloads which actually do work where the this script mostly spends time on very small tasks of 1) looking for a couple of patterns in a single stream; 2) writing data out; 3) occasionally stepping through a couple of loops writing out expired data.

Question: Given the nature of this script, would threading help or does such things as the single input stream and the needing to relate the data before writing hinder that?

The main code loop:

while(<>){ chop; if (/^node=(\S+).*audit\((\d+\....):(\d+)\)/){ if (! $time{$2}{$1}{$3}){ $time{$2}{$1}{$3}=1; } if (/^node=(\S+) type=EOE msg=audit\((\d+\....):(\d+)\)/){ print_data($1,$2,$3); $totalevents++; }else{ push(@{$data{$1}{"$2:$3"}},$_); } } $cnt++; if ($cnt > $agecheck){ # see if entries have aged off and should be written out $date=&dateonly; while (my ($t)=each(%time)){ if ($t < (time() - $age)){ foreach my $host (keys(%{$time{$t}})){ foreach my $event (keys(%{$time{$t}{$host}})){ logit("Aged: node=$host $t:$event"); print_data($host,$t,$event); update_stats(); } } } } $cnt=0; } $totallines++; }

sub print_data { ... # Dedup the data base on data in the string # Parent: http://www.perlmonks.org/bare/?node_id=104565 # specific post: http://www.perlmonks.org/bare/?node_id=104602 my $singleline=join(" ",@{$data{$host}{"$time:$event"}}); $databefore+=length($singleline); $singleline=~s/((\S+)\s?)/$count{$2}++ ? '' : $1/eg; $dataafter+=length($singleline); print ${$fh} "$singleline\n"; ... }

Replies are listed 'Best First'.
Re: Multi-CPU when reading STDIN and small tasks
by Marshall (Canon) on Jan 30, 2017 at 06:38 UTC
    I've been reading this thread and the questions from BrowserUk. I probably don't completely understand the issues here, but I will interject a few thoughts that might help spur other thoughts...

    I think that you want separate lines based upon the node and the audit information?
    From what I understand, it could be that a node will spew out interleaved audit info representing 2 different events although your example data does not show that?

    I was looking at your code and my first thought was, "hey, run the regex engine once!" instead of multiple times. The regex engine is wonderful, but "expensive" CPU wise when used in high volume parsing applications.

    This code:

    if (/^node=(\S+).*audit\((\d+\....):(\d+)\)/){} if (/^node=(\S+) type=EOE msg=audit\((\d+\....):(\d+)\)/){}
    invokes the regex engine twice. I wrote some code to do a regex only once, but deleted it because...

    I do not see the need to use the regex engine at all! You have a very regularly formatted log file and the much more "light-weight" split on a single character can be used.

    I hope my attached code will spur you on to modify your code and do some more benchmarks.

    For the "time_out" code, I would only run that once per X minutes. Cycle through all entries in the %time hash to find anybody who hasn't been updated in the last 5 minutes as per your clock - no need to decode the "audit" time information. This is like a "garbage collection" process that happens occasionally (although that analogy doesn't apply to Perl itself).

    This code produces a slightly different output than yours does. But you are apparently defining a new format and that might be ok?

    I am very skeptical that multi-threads or multi-processes can help in this situation. At the end of the day, you are writing to a single disk file. There can be a lot of overhead in flushing buffers between CPU thread/process context switches. An attempt to make this mult-threaded may actually run much slower. Run benchmarks on efficient single threaded program as a first step before getting into a much more complicated execution scenario.

    #!/usr/bin/perl use strict; use warnings; my %time_out; # last local time stamp for each %data key my %data; my $next_cleanup_due; #a local time = 5 minutes from last cleanup while (<DATA>) { chomp; next unless (index($_,"=")>0); # skip blank or malformatted lines # must have at least one "=" sign my $current_time=time(); # "epoch" time. # note: "nodef" means "node field", not a parsed node=aaaaaaaaaa # note: the use of 3rd field in the split() to limit the fields. my ($nodef, $typef, $auditf, $textf) = split(' ', $_, 4); $textf //= ""; # type=EOE has no logged text field push @{$data{"$nodef $auditf"}}, "$typef $textf"; $time_out{"$nodef $auditf"}= $current_time; #local epoch time # use index() instead of regex for EOE detection dump_data ("$nodef $auditf") if (index($typef,"EOE")>0); ## see my posted text... re: time_out strategy and methods ## if ($current_time > $next_cleanup_due) #only every 5 minutes ## { ## run through all key,values of %time and dump_data() ## for all %time values that have "expired"... ## probably need to add special kind of EOE message ## } } sub dump_data { my $node_audit= shift; print "$node_audit ", join(" ",@{$data{$node_audit}}), "\n"; #clean up hashes: delete $data{$node_audit}; delete $time_out{$node_audit}; } =prints node=aaaaaaaaaa msg=audit(1485583203.459:5485148): type=SYSCALL arch=c +000003e syscall=59 success=no exit=-2 a0=7f30b9d87149 a1=7f30b9d86860 + a2=7f30b9d86bd8 a3=7f30b9d9c8c0 items=1 ppid=xxxxx pid=xxxxx auid=xx +xxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx egid=xxxxx +sgid=xxxxx fsgid=xxxxx tty=(none) ses=16439 comm="command" exe="/bin/ +ksh93" key="cmdlineExecution" type=CWD cwd="/a/cwd" type=PATH item=0 +name="/etc/uname" nametype=UNKNOWN type=EOE node=xxxxxxxxxx msg=audit(1485583201.776:5485082): type=SYSCALL arch=c +000003e syscall=82 per=400000 success=yes exit=0 a0=7fc164006990 a1=7 +fc164006b70 a2=7fc164006b70 a3=7fc230853278 items=4 ppid=xxxxx pid=xx +xxx auid=xxxxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx +egid=xxxxx sgid=xxxxx fsgid=xxxxx tty=(none) ses=4294967295 comm="som +ecommand" exe="/full/path/to/somecommand" key="delete" type=CWD cwd=" +/another/cwd" type=PATH item=0 name="arg-data-0" inode=268805 dev=fd: +14 mode=040740 ouid=xxxxx ogid=xxxxx rdev=00:00 nametype=PARENT type= +PATH item=1 name="arg-data-1" inode=268805 dev=fd:14 mode=040740 ouid +=xxxxx ogid=xxxxx rdev=00:00 nametype=PARENT type=PATH item=2 name="a +rg-data-2" inode=269256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=xxxxx +rdev=00:00 nametype=DELETE type=PATH item=3 name="arg-data-3" inode=2 +69256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=xxxxx rdev=00:00 nametyp +e=CREATE type=EOE =cut __DATA__ # this bad line here on purpose node=xxxxxxxxxx type=SYSCALL msg=audit(1485583201.776:5485082): arch=c +000003e syscall=82 per=400000 success=yes exit=0 a0=7fc164006990 a1=7 +fc164006b70 a2=7fc164006b70 a3=7fc230853278 items=4 ppid=xxxxx pid=xx +xxx auid=xxxxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx +egid=xxxxx sgid=xxxxx fsgid=xxxxx tty=(none) ses=4294967295 comm="som +ecommand" exe="/full/path/to/somecommand" key="delete" node=xxxxxxxxxx type=CWD msg=audit(1485583201.776:5485082): cwd="/ano +ther/cwd" node=aaaaaaaaaa type=SYSCALL msg=audit(1485583203.459:5485148): arch=c +000003e syscall=59 success=no exit=-2 a0=7f30b9d87149 a1=7f30b9d86860 + a2=7f30b9d86bd8 a3=7f30b9d9c8c0 items=1 ppid=xxxxx pid=xxxxx auid=xx +xxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx egid=xxxxx +sgid=xxxxx fsgid=xxxxx tty=(none) ses=16439 comm="command" exe="/bin/ +ksh93" key="cmdlineExecution" node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=0 na +me="arg-data-0" inode=268805 dev=fd:14 mode=040740 ouid=xxxxx ogid=xx +xxx rdev=00:00 nametype=PARENT node=aaaaaaaaaa type=CWD msg=audit(1485583203.459:5485148): cwd="/a/c +wd" node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=1 na +me="arg-data-1" inode=268805 dev=fd:14 mode=040740 ouid=xxxxx ogid=xx +xxx rdev=00:00 nametype=PARENT node=aaaaaaaaaa type=PATH msg=audit(1485583203.459:5485148): item=0 na +me="/etc/uname" nametype=UNKNOWN node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=2 na +me="arg-data-2" inode=269256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=x +xxxx rdev=00:00 nametype=DELETE node=aaaaaaaaaa type=EOE msg=audit(1485583203.459:5485148): node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=3 na +me="arg-data-3" inode=269256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=x +xxxx rdev=00:00 nametype=CREATE node=xxxxxxxxxx type=EOE msg=audit(1485583201.776:5485082):
    Update:

    An attempt clarify the timeout code...
    Two important parms:

    1. $next_cleanup_due, the next time() that "cleanup()" will run
    2. $time_out, an individual "$node $audit" time_out
    Basically, run the hash cleanup() code every so often. When that code runs, any entries that are "timed out" are output and internal structures updated. These variables are independent. Could be cleanup hash runs once per minute and individual time out is 5 minutes. Could be that the cleanup hash routine runs once per hour while individual time out is 5 minutes.

    Something like this (untested):

    ## see my posted text... ## if ($current_time > $next_cleanup_due) #only every 5+ minutes ## { ## run through all key,values of %time and dump_data() ## for all %time values that have "expired"... ## probably need to add special kind of EOE message ## } cleanup() if ($current_time > $next_cleanup_due); sub cleanup { $next_cleanup_due += 6*60; # 6 minutes from now foreach my $node_audit (keys %time_out) { if ($current_time > ($time_out{$node_audit} + $time_out)) { push @{$data{$node_audit}}, "type=EXPIRED"; dump_data ($node_audit); } } }
    I would expect that the code runs really fast until say once per minute or 5 minutes, it slows down to look at all it is working on and cleans up expired stuff. This "time out stuff" is relatively "expensive" CPU wise. However, I recommend that you benchmark something similar to my suggestions and see what happens.
    Performance keys:
    1. Minimize use of regex engine
    2. use the most simple data structures possible - multi-dimensional hash can get "expensive"
    3. don't look at the whole hash of "work in progress" more often than you need to.

      Thank you for taking the time Marshall. Let me try to add some clarity and a little feedback based on a couple of quick tests

      I think that you want separate lines based upon the node and the audit information?
      As much as it can be defined, the issue is that I need to increase throughput from the script as started by only being able to handle a fraction of what it needed to. The results need the multiline event folded into a single line when reading STDIN where the lines are intermixed between nodes.

      From what I understand, it could be that a node will spew out interleaved audit info representing 2 different events although your example data does not show that?
      Your understanding is correct. 100's of machines sending syslog messages to a syslog server but it's output stream is serial which feeds to STDIN of this script.

      I was looking at your code and my first thought was, "hey, run the regex engine once!" instead of multiple times.
      Both you and BrowserUk pointed this out in one way or another and you opened my eyes to methods of thinking of the data in ways that it is needed minimally or not at all (index was new to me)

      ...This is like a "garbage collection" process that happens occasionally
      Yeah, it really is much like that. In the collection I was trying to avoid using time() as it had a slight impact when processing 1.5M records and while light, why use it of the data's already available. Was using a line count as a trigger to perform the collection though I could have just as easily used the extracted time (if it's extracted) and added N time to it to determine when to collect again.

      I am very skeptical that multi-threads or multi-processes can help in this situation. At the end of the day, you are writing to a single disk file.
      The reason I thought about threading is because the need to "dedupe" the data (remove the multiples of the same field from the flattened record) was performed by a single regex and seemed to be as good as it gets but it also accounted for an 100-200% increase in processing time. If I am able to offload that to a thread (or two) while the main loop goes on reading data then there would be a benefit. I think I was able to achieve. The other benefit is that writing to a single file is not a requirement and each thread could write independently to the separate files. For most of the testing I performed I left off the disk writes for now.

      The code you provided performed very well:

      $ wc -l audit-day.log
      1622199 audit-day.log
      $ ls -lh audit-day.log
      -rw-r-----. 1 xxxxxxx xxxx 264M Jan 29 11:02 audit-day.log
      
      Original code you provided (thank you!):
      $ time cat audit-day.log|./test-alt.pl >/dev/null
      
      real    0m3.07s
      user    0m3.02s
      sys     0m0.17s
      
      After removing the call to time(), the related assignment and delete:
      $ time cat audit-day.log|./test-alt.pl >/dev/null
      
      real    0m2.67s
      user    0m2.62s
      sys     0m0.14s
      
      And even further enhancement by appending the data to the end of the hash instead of using push/join:
      $ time cat audit-day.log|./test-alt.pl >/dev/null
      
      real    0m2.46s
      user    0m2.41s
      sys     0m0.14s
      
      Without the garbage collection this still is not realistic but helps in being able to judge the impact of certain choices which is what I have been doing a lot of over the last couple of days.

        I'm glad to hear that you are seeing some performance increases! This 2-3x is in the range of what I expected with index instead of regex, but that evidently is still "not enough":

        Without the garbage collection this still is not realistic but helps in being able to judge the impact of certain choices which is what I have been doing a lot of over the last couple of days.

        What performance benchmark do you think needs to be met in order for the system as a whole to work? Since you have excluded the time-out code for the moment and we've tweaked a number of issues, there isn't a whole lot of "meat" left on these bones!

        I don't see any super easy miraculous 10x (order of magnitude) solution. Even writing this thing in C is maybe just another 2-3x. From "reading between the lines", it sounds like you would like to do even more processing than the code that we've been benchmarking?

        Backing up a bit about the requirements... how does the output from your hundreds of servers come to be merged into a single pipe? Is there some way to distribute the load further "upstream" into multiple "fire hoses" instead of just a single one?
        Is it ok if Server123's data is on a separate machine from Server789's? It sounds to me like a server process model is more appropriate than threads because this is sounding like you will wind up needing multiple machines. That kind of approach can yield a 10x type of performance increase and be scalable.

        Of course of interest is what is driving your requirements to begin with? What is the "user end product" result? I mean so we have collected all the lines for a single node/time/event into a single line, so what? Why is that a requirement and why is that helpful? Maybe there is a way to do the processing of whatever "end result" you desire without this very high performance program? I don't know, but this is an obvious question.

        Update: Another thought about your benchmark,
        $ time cat audit.log|./auditd-linux-orig.pl >/dev/null
        This running of cat and piping into auditd-linux-orig.pl and re-directing shell output could potentially have some performance impact. Out of curiosity, is there any difference if auditd-linux-orig.pl opens a file handle for read from audit.log and a file handle for write to /dev/null? Instead of using the shell re-direction? Of course there is also a small difference included in your benchmark for Perl to load and compile. I am currently using Windows and I'm not sure if any measurement that I made would be applicable to your system.

Re: Multi-CPU when reading STDIN and small tasks
by BrowserUk (Patriarch) on Jan 28, 2017 at 16:06 UTC

    Some examples of your multi-line inputs would help.

      Each line is an individual syslog message which means that related lines will rarely be consecutive though they all related lines for an event should arrive in a very short time. In this data example, there are two separate events which need their data related but the events themselves are independent of each other. The events can be identified by 1) node=xxxxxxxxxx; and 2) node=aaaaaaaaaa
      node=xxxxxxxxxx type=SYSCALL msg=audit(1485583201.776:5485082): arch=c000003e syscall=82 per=400000 success=yes exit=0 a0=7fc164006990 a1=7fc164006b70 a2=7fc164006b70 a3=7fc230853278 items=4 ppid=xxxxx pid=xxxxx auid=xxxxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx egid=xxxxx sgid=xxxxx fsgid=xxxxx tty=(none) ses=4294967295 comm="somecommand" exe="/full/path/to/somecommand" key="delete"
      node=xxxxxxxxxx type=CWD msg=audit(1485583201.776:5485082):  cwd="/another/cwd"
      node=aaaaaaaaaa type=SYSCALL msg=audit(1485583203.459:5485148): arch=c000003e syscall=59 success=no exit=-2 a0=7f30b9d87149 a1=7f30b9d86860 a2=7f30b9d86bd8 a3=7f30b9d9c8c0 items=1 ppid=xxxxx pid=xxxxx auid=xxxxx uid=xxxxx gid=xxxxx euid=xxxxx suid=xxxxx fsuid=xxxxx egid=xxxxx sgid=xxxxx fsgid=xxxxx tty=(none) ses=16439 comm="command" exe="/bin/ksh93" key="cmdlineExecution"
      node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=0 name="arg-data-0" inode=268805 dev=fd:14 mode=040740 ouid=xxxxx ogid=xxxxx rdev=00:00 nametype=PARENT
      node=aaaaaaaaaa type=CWD msg=audit(1485583203.459:5485148):  cwd="/a/cwd"
      node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=1 name="arg-data-1" inode=268805 dev=fd:14 mode=040740 ouid=xxxxx ogid=xxxxx rdev=00:00 nametype=PARENT
      node=aaaaaaaaaa type=PATH msg=audit(1485583203.459:5485148): item=0 name="/etc/uname" nametype=UNKNOWN
      node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=2 name="arg-data-2" inode=269256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=xxxxx rdev=00:00 nametype=DELETE
      node=aaaaaaaaaa type=EOE msg=audit(1485583203.459:5485148):
      node=xxxxxxxxxx type=PATH msg=audit(1485583201.776:5485082): item=3 name="arg-data-3" inode=269256 dev=fd:14 mode=0100640 ouid=xxxxx ogid=xxxxx rdev=00:00 nametype=CREATE
      node=xxxxxxxxxx type=EOE msg=audit(1485583201.776:5485082):
      

        Is it correct to assume that:

        1. the audit parameters (timestamp+???) on all the parts of a single multiline entry from a given node will be the same?
        2. When a line is received from a node with different audit parameters to the last line received from that node, that previous multi-line entry is complete?

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Multi-CPU when reading STDIN and small tasks
by bspencer (Acolyte) on Jan 29, 2017 at 01:43 UTC

    I've continued to play with this and looked at the threads module as a method of threading the print_data sub but it looks like the thread management takes more time than that actual sub because the processing in the sub is not that much (even though it seems to be a major time consumer for overall processing)

    Because of this, I started to change direction a little and looking at the main loop simply placing the events in a queue and let "worker" threads continually processing those queues. I haven't gotten too far in this but sharing data between the threads seems to be the issue currently. The other "complication" (good or bad) is that no file locking is being used because of the overhead which requires some knowledge of which files each thread is writing to so that thread is forever responsible to write to that file. This "pinning" to a thread is based on the hostname which is already extracted. Again tough, stuck on data sharing when trying to use this process:

    1. Set up 1..n "worker" threads in an infinite loop watching their queue
    2. main loop gathers the data
    3. event found, send it to the "queue" sub
    4. The "queue" sub determined which worker queue to place the data in
    5. That queue worker supposedly will see the data and process it
    The "queue" is a single hash shared between the two threads. Even as I write that I wonder why I did that

    A few stats of the current (single threaded version) code which will support data from about 200-300 hosts on a single syslog server CPU at the rate we are seeing. Unfortunately, that is at best only half of what it needs to.

    $ time cat audit.log|./auditd-linux-orig.pl >/dev/null
    Running for: 11 seconds
    Total Lines: 498798 (45345.3 per second)
    Total Events: 81192 (7381.1 per second)
    Dedupe Savings:  31.0% (76.6MB reduced to 52.8MB)
    
    real    0m10.81s
    user    0m10.33s
    sys     0m0.31s