Re^5: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?

Replies are listed 'Best First'.
Re^6: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 26, 2022 at 16:49 UTC
Ok so I did a lot of tests, placing debug timers in various xs and Perl code places and using NYTProf. I think there are two problems: the Perl code which is executed every time the IO::Socket::SSL::sysread() function is called is not optimized well enough to be executed tens of thousands of times per second or more (mainly when non-blocking mode is used) there are too many SSLeay functions which need to be called from Perl code every time an unproductive read is performed (Net::SSLeay::read() + loop calling both Net::SSLeay::get_error() and Net::SSLeay::ERR_clear_error()) I managed to workaround the first problem by re-implementing the SSL read logic directly in my Perl code, without calling any IO::Socket::SSL function, just by calling the SSLeay xs functions directly. This allowed me to halve the average execution time of the unproductive reads. Performances still aren't satisfying though, due to the second problem. I guess the only way to address the second problem would be to re-implement the IO::Socket::SSL::_generic_read() function (or maybe directly the IO::Socket::SSL::sysread() function ?) in xs instead of Perl. It seems it's not possible to call that many external functions from Perl in a loop being executed tens of thousands times per second, without the overhead starting to show and impacting the performances. If the main SSL read function, handling both the read and the error management (updating IO::Socket::SSL internals), was the only one which had to be called from Perl, I think we could get acceptable performance. Some profiling data before and after optimization: Before optimization: Flame graph Top subroutines After optimization (bypassing IO::Socket::SSL): Flame graph Top subroutines	[reply]
Re^7: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by NERDVANA (Priest) on Jun 27, 2022 at 18:06 UTC
I think you make a pretty good case here for adding a new helper method to SSLeay that would offload some of the work of IO::Socket::SSL into XS. Hopefully the maintainers are intrigued by the problem. In the meantime, assuming you have work to get on with, your application should probably just use SSLeay directly :-) Maybe just have it assume that a undef read is an unproductive read for the first N times it happens, then check the error codes if it continues failing? I'd also like to point out that the flame graph looks like you could save about 10% overhead by calling `$socket->sysread(...)` instead of `sysread($socket, ...)` which is why I suggested that earlier. It's unfortunate that perl's global functions add that much overhead when operating on a user object, but a good incentive to move to OO-style syntax on handles.	[reply] [d/l] [select]
Re^8: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 27, 2022 at 20:06 UTC
your application should probably just use SSLeay directly :-) Maybe just have it assume that a undef read is an unproductive read for the first N times it happens, then check the error codes if it continues failing? Yes, that's exactly what I ended up doing as final optimization in my non-blocking SSL test client :) I was a bit afraid of memory leak due to error data not being pop'ed from the SSLeay structure after they occur, but apparently there's no memory leak. It's still not perfect though, the CPU usage is still much higher than in blocking mode, but at least I no longer have a core with 100% usage during non-blocking 1Gbps SSL transfers. Now I just need to integrate and test this code into my actual application, which is another story... I'd also like to point out that the flame graph looks like you could save about 10% overhead by calling $socket->sysread(...) instead of sysread($socket, ...) which is why I suggested that earlier. It's unfortunate that perl's global functions add that much overhead when operating on a user object, but a good incentive to move to OO-style syntax on handles. That's true, however in my case most of the CPU time was actually wasted in IO::Socket::SSL::sysread(), that's why I didn't notice significant improvements with this change (I still had a CPU core used at 100%).	[reply]
Re^6: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 26, 2022 at 10:13 UTC
Well that's pretty bizarre! So for some reason, attempting to read an incomplete SSL packet takes 3x as long as reading and decoding a packet? Your results point heavily toward a bad design inside libssl itself. Silly me, sorry I guess I was really tired yesterday... I made a stupid copy-paste mistake so that the "Avg productive sysread time" and "Avg unproductive sysread time" values were wrong for the non-blocking SSL samples. I fixed them... Though it doesn't change the fact that the unproductive sysreads take more time than they should and are responsible for the high CPU usage in my opinion. The first SSL sample of the three I posted is interesting because there is about the same number of sysread calls by second as the non-SSL sample. It means that during this sample there was close to one (unproductive) sysread for each TCP packet received, as in the non-SSL case. This allows us to compare the time of the unproductive SSL sysread calls to that of the non-SSL sysread calls: 6.33 us versus 2.73 us, more than twice as long for a one-TCP-packet read. Just to be sure though, what if you make a quick change to your SSLeay.xs to verify that the problem is in SSL_read... I added your debug traces and I did let run my script for about one minute and a half, redirecting stderr to a file. First, of course it impacted a bit the metrics produced by my script. For example I no longer have samples with very high sysread call rate and productive sysread ratio < 10%. Here is an example of sample (the most common one) produced by my script when using your patched SSLeay binary: `Interval: 1.00s Avg transfer speed: 117.55 MB/s Nb sysread calls: 52362 Productive sysread size: 16384 B Productive sysread ratio: 13.70% Avg productive sysread time: 21.68 us Avg unproductive sysread time: 13.37 us` [download] (it looks like the second sample I posted previously, with similar sysread call rate and productive sysread ratio, except the timings are a bit higher, which makes sense I guess) And here is a sample of the output produced by your debug traces: SSL_read(s, buf, 16384) => 16384 (9 us) SSL_read(s, buf, 16384) => -1 (7 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => 16384 (10 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => 16384 (12 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => 16384 (12 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (3 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => 16384 (12 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) [download] We can see there is about 6 unproductive reads for 1 productive read at the time the sample was taken. This is the most common pattern from my observation. I found some extreme cases like these ones, but they are very rare: `SSL_read(s, buf, 16384) => 16384 (12 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (92 us) SSL_read(s, buf, 16384) => 16384 (14 us) ... SSL_read(s, buf, 16384) => 16384 (9 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (85 us) SSL_read(s, buf, 16384) => 16384 (13 us) ... SSL_read(s, buf, 16384) => 16384 (18 us) SSL_read(s, buf, 16384) => -1 (75 us) SSL_read(s, buf, 16384) => 16384 (24 us)` [download] Some poor man's metrics regarding the recorded data: `$ grep -c "=> 16384" debug_ssl2.txt 613635 $ grep -c "=> \-1" debug_ssl2.txt 3779378` [download] 61365 productive reads 3779378 unproductive reads => ~13.97% productive sysread ratio That is consistent with what my script is reporting (phew!) What is NOT consistent though, is the timing for the sysread, which would confirm the problem doesn't come from libssl itself...	[reply] [d/l] [select]
Re^6: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 26, 2022 at 12:03 UTC
I did a small Perl script to parse your output, here are the results when processing the file: `16384: total nb: 613635 total time: 6885255 avg time: 11.2204404898678 -1: total nb: 3779378 total time: 10540192 avg time: 2.78886949122316` [download] 11.22 us in average for the "productive" calls of the SSL_read() function from libssl 2.79 us in average for the "unproductive" calls of the SSL_read() function from libssl The "unproductive" calls timing at libssl side seems pretty consistent with the timing I have for the sysread calls in the non-SSL case (2.73 us).	[reply] [d/l]