in reply to Re: Multi-CPU when reading STDIN and small tasks
in thread Multi-CPU when reading STDIN and small tasks

Thank you for taking the time Marshall. Let me try to add some clarity and a little feedback based on a couple of quick tests

I think that you want separate lines based upon the node and the audit information?
As much as it can be defined, the issue is that I need to increase throughput from the script as started by only being able to handle a fraction of what it needed to. The results need the multiline event folded into a single line when reading STDIN where the lines are intermixed between nodes.

From what I understand, it could be that a node will spew out interleaved audit info representing 2 different events although your example data does not show that?
Your understanding is correct. 100's of machines sending syslog messages to a syslog server but it's output stream is serial which feeds to STDIN of this script.

I was looking at your code and my first thought was, "hey, run the regex engine once!" instead of multiple times.
Both you and BrowserUk pointed this out in one way or another and you opened my eyes to methods of thinking of the data in ways that it is needed minimally or not at all (index was new to me)

...This is like a "garbage collection" process that happens occasionally
Yeah, it really is much like that. In the collection I was trying to avoid using time() as it had a slight impact when processing 1.5M records and while light, why use it of the data's already available. Was using a line count as a trigger to perform the collection though I could have just as easily used the extracted time (if it's extracted) and added N time to it to determine when to collect again.

I am very skeptical that multi-threads or multi-processes can help in this situation. At the end of the day, you are writing to a single disk file.
The reason I thought about threading is because the need to "dedupe" the data (remove the multiples of the same field from the flattened record) was performed by a single regex and seemed to be as good as it gets but it also accounted for an 100-200% increase in processing time. If I am able to offload that to a thread (or two) while the main loop goes on reading data then there would be a benefit. I think I was able to achieve. The other benefit is that writing to a single file is not a requirement and each thread could write independently to the separate files. For most of the testing I performed I left off the disk writes for now.

The code you provided performed very well:

$ wc -l audit-day.log
1622199 audit-day.log
$ ls -lh audit-day.log
-rw-r-----. 1 xxxxxxx xxxx 264M Jan 29 11:02 audit-day.log
Original code you provided (thank you!):
$ time cat audit-day.log|./test-alt.pl >/dev/null

real    0m3.07s
user    0m3.02s
sys     0m0.17s
After removing the call to time(), the related assignment and delete:
$ time cat audit-day.log|./test-alt.pl >/dev/null

real    0m2.67s
user    0m2.62s
sys     0m0.14s
And even further enhancement by appending the data to the end of the hash instead of using push/join:
$ time cat audit-day.log|./test-alt.pl >/dev/null

real    0m2.46s
user    0m2.41s
sys     0m0.14s
Without the garbage collection this still is not realistic but helps in being able to judge the impact of certain choices which is what I have been doing a lot of over the last couple of days.

  • Comment on Re^2: Multi-CPU when reading STDIN and small tasks

Replies are listed 'Best First'.
Re^3: Multi-CPU when reading STDIN and small tasks
by Marshall (Canon) on Feb 02, 2017 at 03:02 UTC
    I'm glad to hear that you are seeing some performance increases! This 2-3x is in the range of what I expected with index instead of regex, but that evidently is still "not enough":

    Without the garbage collection this still is not realistic but helps in being able to judge the impact of certain choices which is what I have been doing a lot of over the last couple of days.

    What performance benchmark do you think needs to be met in order for the system as a whole to work? Since you have excluded the time-out code for the moment and we've tweaked a number of issues, there isn't a whole lot of "meat" left on these bones!

    I don't see any super easy miraculous 10x (order of magnitude) solution. Even writing this thing in C is maybe just another 2-3x. From "reading between the lines", it sounds like you would like to do even more processing than the code that we've been benchmarking?

    Backing up a bit about the requirements... how does the output from your hundreds of servers come to be merged into a single pipe? Is there some way to distribute the load further "upstream" into multiple "fire hoses" instead of just a single one?
    Is it ok if Server123's data is on a separate machine from Server789's? It sounds to me like a server process model is more appropriate than threads because this is sounding like you will wind up needing multiple machines. That kind of approach can yield a 10x type of performance increase and be scalable.

    Of course of interest is what is driving your requirements to begin with? What is the "user end product" result? I mean so we have collected all the lines for a single node/time/event into a single line, so what? Why is that a requirement and why is that helpful? Maybe there is a way to do the processing of whatever "end result" you desire without this very high performance program? I don't know, but this is an obvious question.

    Update: Another thought about your benchmark,
    $ time cat audit.log|./auditd-linux-orig.pl >/dev/null
    This running of cat and piping into auditd-linux-orig.pl and re-directing shell output could potentially have some performance impact. Out of curiosity, is there any difference if auditd-linux-orig.pl opens a file handle for read from audit.log and a file handle for write to /dev/null? Instead of using the shell re-direction? Of course there is also a small difference included in your benchmark for Perl to load and compile. I am currently using Windows and I'm not sure if any measurement that I made would be applicable to your system.

      Distracted over the last few days actually working on some of the items you mentioned (before seeing your comments).

      What performance benchmark do you think needs to be met in order for the system as a whole to work?
      Each machine is on average sending 170 events / second in the little POC we are working through. Ideally I would have liked to see a single instance handle around 100K lines to reduce the number of other workarounds required. Using the index method (without nothing else such as writing/processing) peaked at about 60K which would be workable but then again, it's not actually processing as it needs to.

      it sounds like you would like to do even more processing than the code that we've been benchmarking?
      The obvious thing which the code wasn't doing in these tests was writing to files. It was simply displaying to STDOUT instead to allow for format confirmation as needed. This is why STDOUT was redirected to /dev/null in the tests. The other extra processing, the removal of duplicate fields, was accounted for in the tests.

      how does the output from your hundreds of servers come to be merged into a single pipe?
      Syslog basically. Each server -> central syslog servers -> STDOUT piped to script -> written to disk -> ingested into something which can't deal with the auditd format

      Is there some way to distribute the load further "upstream" into multiple "fire hoses" instead of just a single one?
      There is and it is the path I've started going down based on last weekend's exploration. Until 10's of conditions are introduced this seems to be workable as a method of spreading the load:

      Server ->                   Script Instance A (based on condition A)
      Server ->  Syslog server -> Script Instance B (based on condition B)
      server ->                   Script Instance C (based on condition C)
      
      One of the requirements is NOT that there is a single file in the end so A, B and C will each create their own file in order to avoid locking/contention between the different "threads". I think, based on an attribute in the server name, I think this may be a workable solution and the number of servers per "thread" will work at least today. I'm still working on this setup to confirm it.

      Of course of interest is what is driving your requirements to begin with?
      The end requirement is to have the auditd data from all of the servers but in a "flattened" format in files so that they can be read into something which will analyze them. Because of the amount of data involved and about a reduction of space required by about 30%, we added the deduplication of data in each of the single lines events.

      Out of curiosity, is there any difference if auditd-linux-orig.pl opens a file handle for read from audit.log and a file handle for write to /dev/null?
      This is not an item I tried. I could have just as easily commented out the print statement which is the only need for the redirection. In actual use the script is reading from STDIN (I like using cat to simulate that) and writing to files. STDOUT is never used.

        Ok, now I understand the performance requirements better.

        Doubling the performance from 60K to 120K lines/sec with your current single process would be possible albeit with some C code. But that still wouldn't do all that you want. I predict that I could code $singleline=~s/((\S+)\s?)/$count{$2}++ ? '' : $1/eg; much more efficiently in ASM rather than in C because there are certain instructions that are difficult for the C compilier to even use. If this was an embedded hardware board application, it would be worth the effort. But here, I think not! I believe you are better served with a pure Perl application

        I think you are on the right direction to distribute this incoming "firehose of data" between multiple processing entities. Right now it appears that you are thinking about one program with multiple threads. I would be thinking of multiple instances of a single threaded process with a "router" process. Let the OS assign these processes to different machine cores. I don't see any requirement for these processes to communicate with each other or share information. A consideration could be how easy it is to just add an additional machine when the load increases?

        Leave your final "print" in the benchmark. That does all the work and it does go to STDOUT, it just result gets re-directed to the "bit bucket".

        I am still curious as to what this analysis program does with this massive amount of data? It seems that some kind of "front-end" to this thing might be possible? Extract perhaps a time window, perhaps all data from Server X from the main log file that is then analyzed in non-real time. It seems to me that the processing power of super fast concatenation of lines and the compression of the data by 30% due to "dupes" must be minuscule to the overall effort of the analysis program? Aside from reducing the storage required, it is not clear how much this will help the "final end result"?