in reply to Re^4: Multi-CPU when reading STDIN and small tasks
in thread Multi-CPU when reading STDIN and small tasks

Ok, now I understand the performance requirements better.

Doubling the performance from 60K to 120K lines/sec with your current single process would be possible albeit with some C code. But that still wouldn't do all that you want. I predict that I could code $singleline=~s/((\S+)\s?)/$count{$2}++ ? '' : $1/eg; much more efficiently in ASM rather than in C because there are certain instructions that are difficult for the C compilier to even use. If this was an embedded hardware board application, it would be worth the effort. But here, I think not! I believe you are better served with a pure Perl application

I think you are on the right direction to distribute this incoming "firehose of data" between multiple processing entities. Right now it appears that you are thinking about one program with multiple threads. I would be thinking of multiple instances of a single threaded process with a "router" process. Let the OS assign these processes to different machine cores. I don't see any requirement for these processes to communicate with each other or share information. A consideration could be how easy it is to just add an additional machine when the load increases?

Leave your final "print" in the benchmark. That does all the work and it does go to STDOUT, it just result gets re-directed to the "bit bucket".

I am still curious as to what this analysis program does with this massive amount of data? It seems that some kind of "front-end" to this thing might be possible? Extract perhaps a time window, perhaps all data from Server X from the main log file that is then analyzed in non-real time. It seems to me that the processing power of super fast concatenation of lines and the compression of the data by 30% due to "dupes" must be minuscule to the overall effort of the analysis program? Aside from reducing the storage required, it is not clear how much this will help the "final end result"?

Replies are listed 'Best First'.
Re^6: Multi-CPU when reading STDIN and small tasks
by Anonymous Monk on Feb 12, 2017 at 23:21 UTC

    I am (now) going down the road of poor-mans-threading and letting rsyslog start up the same script based on criteria. It's working well in the small scale POC we have going.

    The deduplication is purely from space savings standpoint and had little or no impact on overall searches being performed. Splunk will be ingested in the data and will be performing the searches against the data. That side of this effort is currently handled by another team and I have no or little insight into it apart from being sure I can get them the data in a format which will work for them.