Distracted over the last few days actually working on some of the items you mentioned (before seeing your comments).
What performance benchmark do you think needs to be met in order for the system as a whole to work? Each machine is on average sending 170 events / second in the little POC we are working through. Ideally I would have liked to see a single instance handle around 100K lines to reduce the number of other workarounds required. Using the index method (without nothing else such as writing/processing) peaked at about 60K which would be workable but then again, it's not actually processing as it needs to.
it sounds like you would like to do even more processing than the code that we've been benchmarking?
The obvious thing which the code wasn't doing in these tests was writing to files. It was simply displaying to STDOUT instead to allow for format confirmation as needed. This is why STDOUT was redirected to /dev/null in the tests. The other extra processing, the removal of duplicate fields, was accounted for in the tests.
how does the output from your hundreds of servers come to be merged into a single pipe?
Syslog basically. Each server -> central syslog servers -> STDOUT piped to script -> written to disk -> ingested into something which can't deal with the auditd format
Is there some way to distribute the load further "upstream" into multiple "fire hoses" instead of just a single one?
There is and it is the path I've started going down based on last weekend's exploration. Until 10's of conditions are introduced this seems to be workable as a method of spreading the load:
Server -> Script Instance A (based on condition A) Server -> Syslog server -> Script Instance B (based on condition B) server -> Script Instance C (based on condition C)One of the requirements is NOT that there is a single file in the end so A, B and C will each create their own file in order to avoid locking/contention between the different "threads". I think, based on an attribute in the server name, I think this may be a workable solution and the number of servers per "thread" will work at least today. I'm still working on this setup to confirm it.
Of course of interest is what is driving your requirements to begin with?
The end requirement is to have the auditd data from all of the servers but in a "flattened" format in files so that they can be read into something which will analyze them. Because of the amount of data involved and about a reduction of space required by about 30%, we added the deduplication of data in each of the single lines events.
Out of curiosity, is there any difference if auditd-linux-orig.pl opens a file handle for read from audit.log and a file handle for write to /dev/null?
This is not an item I tried. I could have just as easily commented out the print statement which is the only need for the redirection. In actual use the script is reading from STDIN (I like using cat to simulate that) and writing to files. STDOUT is never used.
In reply to Re^4: Multi-CPU when reading STDIN and small tasks
by bspencer
in thread Multi-CPU when reading STDIN and small tasks
by bspencer
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |