Re^3: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?

Well you seem to have explored most of those pretty well already. My point though about timing the productive vs. unproductive ssl reads was to get a feeling for the efficiency of the internals of the perl SSL implementation without (yet) diving down into the code.

In the best case, the implementation of IO::Socket::SSL's sysread would look something like this:

sub sysread(...) {
   if ($self->{decoded_remaining}) { return ... }
   else if (sysread($self->{underlying_socket}, $self->{buffer}, ...) 
+> 0) {
      $self->{next_frame_size} //= decode_frame_size($self->{buffer});
      if (length $self->{buffer} >= $self->{next_frame_size}) {
        ... # decode it
        return ...
      }
      $SSL_ERROR= ...;
      return undef;
   }
}
[download]

and that should run reasonably efficiently if you call it a bunch of times and the length() test is false. In fact, it shouldn't be too much worse than the same written in C, because there's just not much happening aside from a syscall and a few tests. In fact, I'd expect the syscall to be more than 75% of the execution time of that example code. Of course, long-lived code is seldom this simple and could have all sorts of extra checks going on making it way more expensive than just the syscall. It might be decoding the frame size on every call instead of caching it, or doing lots of other unnecessary work, or even making multiple syscalls for some reason. Have you tried systrace on it to see the actual calls? If you see unexpected things, then its time to dive into the internals of SSL and see if there is a way to avoid triggering that. You could end up reporting a bug that helps everyone.

On the other hand, just executing 40,000 syscalls even in a compiled language is expensive, and could possibly peg a CPU. By comparing the calls that return data (which execute the decode step) with the calls that don't, it could give a hint about that.

If perl is not adding excessive overhead, then you could be up against the limits of nonblocking reads to the TCP stack. You want there to be fewer sysreads, but the TCP stack is returning smaller fragments of data than are useful to process, and waking your rogram too often. In this case, you could implement your own variant of Nagle's algorithm, and just delay a number of milliseconds after the socket is readable before performing the read, thus collecting more data per read for better throughput, and not hurting the latency too much since you are doing bulk transfers and not tiny two-way signaling. Of course this assumption only works when you know the sender is writing large packets, otherwise you'd be delaying the processing of small packets. I'd suggest making the delay dynamic, in response to the number of unproductive reads. This way, the code can auto-tune itself to systems with vastly different network or processor speeds.

Comment on Re^3: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? Download Code

Replies are listed 'Best First'.
Re^4: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 25, 2022 at 11:19 UTC
Well you seem to have explored most of those pretty well already. My point though about timing the productive vs. unproductive ssl reads was to get a feeling for the efficiency of the internals of the perl SSL implementation without (yet) diving down into the code. Here are some metrics for various non-blocking test cases. SSL: (I provided 3 samples because the metrics can vary, but the most common sample type is the third one) Interval: 1.00s Avg transfer speed: 117.61 MB/s Nb sysread calls: 78837 Productive sysread size: 16384 B Productive sysread ratio: 9.10% Avg productive sysread time: 13.86 us Avg unproductive sysread time: 6.33 us Interval: 1.00s Avg transfer speed: 117.61 MB/s Nb sysread calls: 56617 Productive sysread size: 16384 B Productive sysread ratio: 12.68% Avg productive sysread time: 20.84 us Avg unproductive sysread time: 9.35 us Interval: 1.00s Avg transfer speed: 117.61 MB/s Nb sysread calls: 35037 Productive sysread size: 16384 B Productive sysread ratio: 20.49% Avg productive sysread time: 27.57 us Avg unproductive sysread time: 16.92 us [download] SSL with socket receive low water mark: (I provided 2 samples because sometimes SSL_WANT_READ happens during the interval, sometimes not) `Interval: 1.00s Avg transfer speed: 117.57 MB/s Nb sysread calls: 7205 Productive sysread size: 16384 B Productive sysread ratio: 99.60% Avg productive sysread time: 25.94 us Avg unproductive sysread time: 18.23 us Interval: 1.00s Avg transfer speed: 117.61 MB/s Nb sysread calls: 7179 Productive sysread size: 16384 B Productive sysread ratio: 100.00% Avg productive sysread time: 42.37 us` [download] Non-SSL: `Interval: 1.00s Avg transfer speed: 117.77 MB/s Nb sysread calls: 80621 Avg sysread size: 1460.78 B Avg sysread time: 2.73 us` [download] As expected, the unproductive SSL sysread calls take much more time than they should and are clearly the culprits (when socket receive low water mark is not used). In the best case, the implementation of IO::Socket::SSL's sysread would look something like this: [...] Unfortunately it doesn't work like that, the SSLeay library does all the job: IO::Socket::SSL::sysread() always calls IO::Socket::SSL::_generic_read(), which in turn always calls Net::SSLeay::read() On the other hand, just executing 40,000 syscalls even in a compiled language is expensive, and could possibly peg a CPU. It might be expensive, but the system can do at least twice this number of syscalls with no problem at all, as you can see in the non-SSL test metrics above. It's not the syscalls themselves which overload the CPU. If perl is not adding excessive overhead, then you could be up against the limits of nonblocking reads to the TCP stack. The TCP stack on this system can handle more than 80K non-blocking I/O by second from Perl, this is what happens when I run the test in non-SSL mode and all works perfectly fine in this case. Regarding the workaround which would consist in artificially delaying the sysread calls with an auto-tuning algorithm: I have already tested something similar and actually it brings a lot of new problems. For example, in order to not impact the transfer speed, the sleep durations need to be very short (especially in the case of high-bandwidth low-delay connection). Unfortunately the minimum sleep duration varies a lot depending on the OS and the hardware. For example, on Windows 10 I couldn't find ways to reliably sleep less than 15ms in Perl (). In my simple case of 1 Gbps SSL transfer, I would need reliable sleeps around 140 us. () Windows 10 minimum sleep duration tests: "usleep(999);" does nothing "usleep(1000);" sleeps for a random number of miliseconds between 2 and 15 "usleep(15000);" sleeps for about 15ms "select(undef,undef,undef,0.001);" sleeps for about 15ms "select(undef,undef,undef,0.000999);" sleeps for a random number of microseconds between 500 and 1000	[reply] [d/l] [select]
Re^5: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by NERDVANA (Priest) on Jun 26, 2022 at 03:56 UTC
Well that's pretty bizarre! So for some reason, attempting to read an incomplete SSL packet takes 3x as long as reading and decoding a packet? Your results point heavily toward a bad design inside libssl itself. Just to be sure though, what if you make a quick change to your SSLeay.xs to verify that the problem is in SSL_read... `PREINIT: char buf; int got; int succeeded = 1; struct timespec ts1, ts2; long dT; PPCODE: New(0, buf, max, char); clock_gettime(CLOCK_MONOTONIC, &ts1); got = SSL_read(s, buf, max); clock_gettime(CLOCK_MONOTONIC, &ts2); dT= (ts2.tv_nsec - ts1.tv_nsec)/1000 + (ts2.tv_sec - ts1.tv_sec)1000000; warn("SSL_read(s, buf, %ld) => %ld (%ld us)\n", (long)max, (long)got, dT); if (got <= 0 && SSL_ERROR_ZERO_RETURN != SSL_get_error(s, got) +) succeeded = 0;` [download] This probably only works on Linux (haven't checked if MSWin32 added clock_gettime yet) and if you aren't able to test, let me know and I will. It will also completely throw off the CPU usage report because now it prints on every read, but it should still be able to measure the duration of the SSL_read calls. (the other option is to capture the timings in a persistent buffer and print them at the end, but that's a lot of hassle.) If the timing of SSL_read matches your results above, then it would be time to head to the openssl project with a bug report. But, if it doesn't, then maybe the slowness is inside SSL_get_error or something else in the surrounding code that only happens on an unproductive read. And the other option is to build against a debug-symbols version of libssl, and go all the way down into it with gdb.	[reply] [d/l]
Re^6: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 26, 2022 at 16:49 UTC
Ok so I did a lot of tests, placing debug timers in various xs and Perl code places and using NYTProf. I think there are two problems: the Perl code which is executed every time the IO::Socket::SSL::sysread() function is called is not optimized well enough to be executed tens of thousands of times per second or more (mainly when non-blocking mode is used) there are too many SSLeay functions which need to be called from Perl code every time an unproductive read is performed (Net::SSLeay::read() + loop calling both Net::SSLeay::get_error() and Net::SSLeay::ERR_clear_error()) I managed to workaround the first problem by re-implementing the SSL read logic directly in my Perl code, without calling any IO::Socket::SSL function, just by calling the SSLeay xs functions directly. This allowed me to halve the average execution time of the unproductive reads. Performances still aren't satisfying though, due to the second problem. I guess the only way to address the second problem would be to re-implement the IO::Socket::SSL::_generic_read() function (or maybe directly the IO::Socket::SSL::sysread() function ?) in xs instead of Perl. It seems it's not possible to call that many external functions from Perl in a loop being executed tens of thousands times per second, without the overhead starting to show and impacting the performances. If the main SSL read function, handling both the read and the error management (updating IO::Socket::SSL internals), was the only one which had to be called from Perl, I think we could get acceptable performance. Some profiling data before and after optimization: Before optimization: Flame graph Top subroutines After optimization (bypassing IO::Socket::SSL): Flame graph Top subroutines	[reply]
Re^7: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by NERDVANA (Priest) on Jun 27, 2022 at 18:06 UTC
Re^8: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 27, 2022 at 20:06 UTC
Re^6: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 26, 2022 at 10:13 UTC
Well that's pretty bizarre! So for some reason, attempting to read an incomplete SSL packet takes 3x as long as reading and decoding a packet? Your results point heavily toward a bad design inside libssl itself. Silly me, sorry I guess I was really tired yesterday... I made a stupid copy-paste mistake so that the "Avg productive sysread time" and "Avg unproductive sysread time" values were wrong for the non-blocking SSL samples. I fixed them... Though it doesn't change the fact that the unproductive sysreads take more time than they should and are responsible for the high CPU usage in my opinion. The first SSL sample of the three I posted is interesting because there is about the same number of sysread calls by second as the non-SSL sample. It means that during this sample there was close to one (unproductive) sysread for each TCP packet received, as in the non-SSL case. This allows us to compare the time of the unproductive SSL sysread calls to that of the non-SSL sysread calls: 6.33 us versus 2.73 us, more than twice as long for a one-TCP-packet read. Just to be sure though, what if you make a quick change to your SSLeay.xs to verify that the problem is in SSL_read... I added your debug traces and I did let run my script for about one minute and a half, redirecting stderr to a file. First, of course it impacted a bit the metrics produced by my script. For example I no longer have samples with very high sysread call rate and productive sysread ratio < 10%. Here is an example of sample (the most common one) produced by my script when using your patched SSLeay binary: `Interval: 1.00s Avg transfer speed: 117.55 MB/s Nb sysread calls: 52362 Productive sysread size: 16384 B Productive sysread ratio: 13.70% Avg productive sysread time: 21.68 us Avg unproductive sysread time: 13.37 us` [download] (it looks like the second sample I posted previously, with similar sysread call rate and productive sysread ratio, except the timings are a bit higher, which makes sense I guess) And here is a sample of the output produced by your debug traces: SSL_read(s, buf, 16384) => 16384 (9 us) SSL_read(s, buf, 16384) => -1 (7 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => 16384 (10 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => 16384 (12 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => 16384 (12 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (3 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => 16384 (12 us) SSL_read(s, buf, 16384) => -1 (6 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (2 us) SSL_read(s, buf, 16384) => -1 (1 us) SSL_read(s, buf, 16384) => -1 (1 us) [download] We can see there is about 6 unproductive reads for 1 productive read at the time the sample was taken. This is the most common pattern from my observation. I found some extreme cases like these ones, but they are very rare: `SSL_read(s, buf, 16384) => 16384 (12 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (92 us) SSL_read(s, buf, 16384) => 16384 (14 us) ... SSL_read(s, buf, 16384) => 16384 (9 us) SSL_read(s, buf, 16384) => -1 (5 us) SSL_read(s, buf, 16384) => -1 (85 us) SSL_read(s, buf, 16384) => 16384 (13 us) ... SSL_read(s, buf, 16384) => 16384 (18 us) SSL_read(s, buf, 16384) => -1 (75 us) SSL_read(s, buf, 16384) => 16384 (24 us)` [download] Some poor man's metrics regarding the recorded data: `$ grep -c "=> 16384" debug_ssl2.txt 613635 $ grep -c "=> \-1" debug_ssl2.txt 3779378` [download] 61365 productive reads 3779378 unproductive reads => ~13.97% productive sysread ratio That is consistent with what my script is reporting (phew!) What is NOT consistent though, is the timing for the sysread, which would confirm the problem doesn't come from libssl itself...	[reply] [d/l] [select]
Re^6: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ? by Yaribz (Beadle) on Jun 26, 2022 at 12:03 UTC
I did a small Perl script to parse your output, here are the results when processing the file: `16384: total nb: 613635 total time: 6885255 avg time: 11.2204404898678 -1: total nb: 3779378 total time: 10540192 avg time: 2.78886949122316` [download] 11.22 us in average for the "productive" calls of the SSL_read() function from libssl 2.79 us in average for the "unproductive" calls of the SSL_read() function from libssl The "unproductive" calls timing at libssl side seems pretty consistent with the timing I have for the sysread calls in the non-SSL case (2.73 us).	[reply] [d/l]