in reply to Multi-CPU when reading STDIN and small tasks

I've continued to play with this and looked at the threads module as a method of threading the print_data sub but it looks like the thread management takes more time than that actual sub because the processing in the sub is not that much (even though it seems to be a major time consumer for overall processing)

Because of this, I started to change direction a little and looking at the main loop simply placing the events in a queue and let "worker" threads continually processing those queues. I haven't gotten too far in this but sharing data between the threads seems to be the issue currently. The other "complication" (good or bad) is that no file locking is being used because of the overhead which requires some knowledge of which files each thread is writing to so that thread is forever responsible to write to that file. This "pinning" to a thread is based on the hostname which is already extracted. Again tough, stuck on data sharing when trying to use this process:

  1. Set up 1..n "worker" threads in an infinite loop watching their queue
  2. main loop gathers the data
  3. event found, send it to the "queue" sub
  4. The "queue" sub determined which worker queue to place the data in
  5. That queue worker supposedly will see the data and process it
The "queue" is a single hash shared between the two threads. Even as I write that I wonder why I did that

A few stats of the current (single threaded version) code which will support data from about 200-300 hosts on a single syslog server CPU at the rate we are seeing. Unfortunately, that is at best only half of what it needs to.

$ time cat audit.log|./auditd-linux-orig.pl >/dev/null
Running for: 11 seconds
Total Lines: 498798 (45345.3 per second)
Total Events: 81192 (7381.1 per second)
Dedupe Savings:  31.0% (76.6MB reduced to 52.8MB)

real    0m10.81s
user    0m10.33s
sys     0m0.31s

  • Comment on Re: Multi-CPU when reading STDIN and small tasks