jvuman has asked for the wisdom of the Perl Monks concerning the following question:
I have an application that listens for SNMP traps using Net::SNMPTrapd. It examines the trap, discards most, and does some work (taking about 5ms) with a few of them. The rate of traps coming in is considerable (>100/s). The current version can handle that rate, but I'd like to make it more scalable. To that end, I've rewritten it with a master process controlling multiple worker threads.
The master process creates a lockfile that the worker threads then use to control access to port 162. The worker threads sit in a loop blocking on flock(). When they succeed in getting the lock, they get a trap from port 162 using Net::SNMPTrapd and give up the lock. Here are the basics of what each worker thread does:
my $snmptrapd = Net::SNMPTrapd->new(ReusePort => 1); while (1) { flock($lockfile,LOCK_EX); my $trap = $snmptrapd->get_trap(); flock($lockfile,LOCK_UN); # filter traps based on IP, etc. here $trap->process_trap(); # do some work here, usually taking 3-5ms }
All this works. However, I find that regardless of the number of worker threads I create, one of them does most - about 70% - of the work. That is OK for now - one thread can handle it at the current volume of traps - but that seems to defeat my object of scalability.
I have tried several methods to encourage the hardest-working thread to sit back and allow the others a turn. The one that I thought made sense was having each thread keep track of how many traps it's received and do a sub-second sleep when its trap count crosses a threshold. That just resulted in dropped traps.
Is there some technique I'm missing here? Running on Perl 5.16 on CentOS 7. Thank you!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: worker threads - one does all the work
by BrowserUk (Patriarch) on Jun 08, 2017 at 22:01 UTC | |
regardless of the number of worker threads I create, one of them does most - about 70% - of the work. When the thread that processes the last trap finishes with it, it is still running (has a timeslice), so it immediately loops back and attempts to obtain the lock. Most of the time it will succeed because none of the other threads are running at that moment in time. The other threads will only get a look in if this thread is swapped out; and that will only happen if it takes longer than its timeslice to process the previous trap. I'm not overly familiar with *nix system priorities and scheduling, but the idea of using a file system lock, even if it is cached, as a distribution mechanism for network IO traffic seems a little like putting a lollipop lady on a motorway. "Scalable" isn't the word that comes to mind here. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.
Suck that fhit
| [reply] |
|
Re: worker threads - one does all the work
by marioroy (Prior) on Jun 09, 2017 at 03:56 UTC | |
Update: Previously, I noted the time after enqueuing 10k traps into the queue. Unfortunately, I didn't factor the traps still pending in the queue and met to report the duration time after the queue is depleated. I've gone through and corrected all my posts in this thread. Hello jvuman and welcome to the amazing monastery. Thank you for introducing Net::SNMPTrapd which I've not used before. It is possible to run a powerful trap server using a single listener and many consumers. One might do so using threads and Thread::Queue or similarly with MCE::Flow and MCE::Queue. The latter is provided below. Here, I have each consumer sleep for 4 milliseconds to simulate work. Awaiting on the queue is a safety measure to prevent the queue from consuming gigabytes of memory in the event receiving millions of traps. Please adjust the threshold to your satisfaction. In my testing, the server process never entered the pending if statement. MCE and MCE::Shared (not used here) involve IPC behind the scene. Fetching is faster from a BSD OS (e.g. FreeBSD, darwin) when compared to Linux. See this post for more info.
That's seems fast considering that 2 producers share CPU time with the listener process and consumers.
In reality, the server process can handle more than 4,000 traps per second simply by running the server and consumers only.
Regards, Mario. | [reply] [d/l] [select] |
by marioroy (Prior) on Jun 09, 2017 at 04:48 UTC | |
Update: I had a chance to run this on a Linux machine. IPC wise (fetching - dequeue) is slower from a Linux OS than from a BSD variant: e.g. FreeBSD, darwin. Running with threads also takes longer. See this post for more info. Below, I've updated the main script to show the enqueue time, pending count, and finally the duration time. A production environment might have a load balancer and 4 pizza-box (1-inch) servers. Together, the 4 servers can handle 1 million traps per minute if that's the scale needed. I reached 4.5k traps per second on my Linux machine. Update: Loading threads at the top of the script will have MCE spawn threads instead. Doing so may cause consumers to run slower for some reason on Linux. Maybe it's from running an older Perl 5.16.3 release. I'm not sure. For maximum performance check to see if Perl has Sereal installed. MCE defaults to Sereal::Encoder 3.015+ and Sereal::Decoder 3.015+ for serealization when available.
Update: I've updated the code to synchronize STDOUT and STDERR output to the manager process. Furthermore, specified user_output and user_error MCE options so that one can send to a logging routine if need be. Omitting them will have the manager process write directly to STDOUT and STDERR respectively. Hello again jvuman, Generating traps is handled by another script. Doing this allowed me to move the trap generator (producer) to another machine. My laptop running the server script can process 6k per second.
Here is the producer script for generating traps in parallel. This is useful for load testing.
To benchmark, run snmp_server.pl on machine A. Then, run snmp_producer.pl on machine B. Remember to change the IP address to host A inside producer if running on another host.
Regards, Mario. | [reply] [d/l] [select] |
by marioroy (Prior) on Jun 09, 2017 at 12:04 UTC | |
The following does similarly to what the OP described. I've experienced lost traps on Linux, causing the server script to never leave the loop. On the Mac, it takes 3.822 seconds to process 10,000 traps when successful.
Regards, Mario. | [reply] [d/l] |
by zentara (Cardinal) on Jun 09, 2017 at 13:52 UTC | |
I'm not really a human, but I play one on earth. ..... an animated JAPH | [reply] |
by marioroy (Prior) on Jun 10, 2017 at 06:07 UTC | |
Hi zentara. Started something on github but am not liking the format and stopped temporarily. At some point will renew the documentation. | [reply] |
| |
|
Re: worker threads - one does all the work
by talexb (Chancellor) on Jun 09, 2017 at 15:07 UTC | |
This is a really interesting post, and reminds me of doing assembler programming using Interrupt Service Routines in the 70's and 80's. To handle the highest possible rate of interrupts, the ISR code needed to be as brief as possible -- it would wake up, grab the data that's just arrived, and stuff it into a circular buffer, for example. The data might be a single character, several characters, or even an entire message. Somewhere else, an idle loop would be watching the same circular buffer for activity, and as soon as something arrived, it would deal with it at normal priority. These two activities live in different worlds -- one doing as little as possible, as quickly as possible, and the other doing the needful, as data arrived. Ideally, this would avoid the situation where interrupts happen faster than they could be processed, resulting in dropped events, and therefore lost data. Applying this to your situation, I might have the parent handle the clunkier processing, and have the child handle the SNMP traps .. but if there are plenty of solutions already in CPAN, that would probably be a better way forward. Thanks again for the intriguing post. | [reply] |
by marioroy (Prior) on Jun 10, 2017 at 07:39 UTC | |
Hi talexb, Regarding the MCE module, the main process enters a loop to handle IPC events while running. The following is a MCE::Hobo and MCE::Shared demonstration, based on the MCE::Flow solution. Here, MCE::Shared spawns a background process to handle IPC events. This allows the main process to listen for traps.
For comparison, the following provides a threads and Thread::Queue demonstration. Fortunately, one may run MCE::Shared alongside threads to get shared-handles support. Please note that this demonstration requires freezing and thawing at the application level. Serialization is typically automatic for MCE and MCE::Shared solutions.
On my Linux box, the MCE::Hobo and MCE::Shared demonstration completes 10k traps in 2.2 seconds. The threads and Thread::Queue demonstration needs more time, unexpectingly and completes in 13.4 seconds. Notice the difference with the number of traps pending in the queue.
The following is the trap generator used to feed both demonstrations. To not impact the listener/consumer script, run this from another host.
Regarding IPC fetch-requests, I've tried for MCE and MCE::Shared on Linux to reach closer to BSD levels. Threads is a mystery sometimes. I'm not sure why threads is running slow under Red Hat / CentOS 7.3 - Perl v5.16.3.
Regards, Mario | [reply] [d/l] [select] |
| |
|
Re: worker threads - one does all the work
by locked_user sundialsvc4 (Abbot) on Jun 08, 2017 at 20:39 UTC | |
I am skeptical that a multi-worker approach would in fact be beneficial if all of the data is coming in strictly from one TCP/IP port. I am skeptical that the additional overhead of the approach that you suggest here might in fact just slow it down. If the work that needs to be done upon receipt of any trap is “non-trivial,” then you might have one “listener” thread that does nothing more than toss the request into a queue for consumption by a second thread or pool of threads, leaving the listener free to process the traps as quickly as they arrive without waiting for any of them to be processed. The latency of the system will be very consistent and very low, even under load. Furthermore, there are several existing frameworks for building “all the necessary plumbing” – including the venerable POE and a variety of thread-safe queues. Everything you might need to set up worker-pools, queues, and to manage the whole thing are already available in CPAN so that you will not start from scratch. | |