in reply to Re: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?
in thread IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?

A few aspects of your code bother me. I could investigate the details, but I apologize that I'm feeling lazy. But, maybe just asking these questions would send you on the right path to discovery?

Sure, thanks for trying to help !


Your code is expecting to read and write entire 16K chunks, and even throws an error if it doesn't get a full 16K chunk. This goes against the normal API of sysread, though of course in this case it is virtual. Does SSL protocol or this particular implementation guarantee whole frames like this? Is 16K a magic SSL number that you can rely on, or will it sometimes give you smaller chunks? Or does it guarantee that you receive the whole chunks of whatever the other side sent?

This is all explained in the second link of my original post (section "Polling of SSL sockets"). I provided these links especially because IO::Socket::SSL has some specific aspects like this one. I could try to paraphrase here, but I'm afraid I wouldn't be as accurate as the original explanation. The fact that the server part is using this application buffer size is also playing a role here, this is what makes the sending SSL layer use the max SSL frame size. If you want you can consider the "die" instruction as a debug/check instruction. As you see the "die" is never executed in my case so it is safe to consider the sysread instructions at client side always return 16Ki data in this specific test case. The "die 'Unexpected pending data in SSL socket'" instruction is exactly the same: a debug instruction to show this never happens in this test case.


Packets have to be smaller than 1.5K for ethernet, so many physical receives will happen before you reach 16K, as you noticed in your subsequent post. Might your choice of 16K be the actual problem that causes lots of looping?

I'm not sure I understand your reasoning. Can you tell me which size you think would be better to use as sysread parameter, and then I will try to explain why I think it wouldn't improve anything ?
Also, as you can see in my previous post, Mojo::IOLoop suffers from the exact same problem (a CPU core is used at 100% when performing a transfer) whereas it uses a hardcoded value of 128Ki for the application buffer size


You mention configuring the "low water mark" on the socket, but that seems wrong. Your application packets are 16K, but the underlying TCP packets could be a variety of sizes, so insisting at the OS level that it not return data until 16K is available seems likely to break things (even if it didn't this time).

For the general case of course it is wrong. SO_RCVLOWAT comes with his own challenges when trying to use it in the general case. I only used SO_RCVLOWAT to check the origin of the problem really was what I though it was, in my specific test case (case of large data transfer on non-blocking SSL socket).


Using Time::HiRes, have you tried measuring the actual overhead of an "unproductive" sysread on IO::Socket::SSL which reads the underlying socket but delivers nothing? How does this compare to the time spent on a productive sysread? It seems to me that an unproductive sysread shouldn't take much longer than a sysread on the underlying socket.

It's not that simple. An "unproductive" sysread from application point of view is actually productive from SSL layer point of view: data are copied from the socket buffer to the internal SSL buffer. So each "unproductive" sysread call will actually make the later "productive" sysread call lighter, because a part of the SSL frame will already be inside the internal SSL buffer. The only way to measure the true overhead is to compare a productive sysread call with no preceding unproductive sysread call, to a productive sysread call with N preceding unproductive sysread calls.
But anyway, I wouldn't say there is an overhead in "unproductive" SSL sysread compared to "productive" SSL sysread, I would say there is an overhead in any SSL sysread (even the "unproductive" ones which don't decrypt anything yet, and that is the main problem in my opinion) compared to non-SSL sysread. That would explain why, in the case of large data transfer using SSL, it's important to save superfluous sysread calls, and not try to read before an entire SSL frame is available (or at least not after each single TCP packet is received).


If you call ->sysread as a class method instead of a tied-IO global function, is there any performance difference?

No, in both cases I have about 40000 sysread calls by second, and a CPU core is used at 100% instead of 55%


Have you verified that the code really does wait on select()? If fileno() didn't work as expected and the select() always returned immediately, it would result in really high CPU usage with no other hints that something was wrong.

Yes, I checked that already, by adding a sleep in the send code at server side, and checking the resulting number of sysread calls at client side. All works as expected. Also, the simple fact that using SO_RCVLOWAT instantly "fixes" the problem without touching the select code is a good indicator that the select call is working as expected.

  • Comment on Re^2: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?

Replies are listed 'Best First'.
Re^3: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?
by NERDVANA (Priest) on Jun 24, 2022 at 18:32 UTC
    Well you seem to have explored most of those pretty well already. My point though about timing the productive vs. unproductive ssl reads was to get a feeling for the efficiency of the internals of the perl SSL implementation without (yet) diving down into the code.

    In the best case, the implementation of IO::Socket::SSL's sysread would look something like this:

    sub sysread(...) { if ($self->{decoded_remaining}) { return ... } else if (sysread($self->{underlying_socket}, $self->{buffer}, ...) +> 0) { $self->{next_frame_size} //= decode_frame_size($self->{buffer}); if (length $self->{buffer} >= $self->{next_frame_size}) { ... # decode it return ... } $SSL_ERROR= ...; return undef; } }

    and that should run reasonably efficiently if you call it a bunch of times and the length() test is false. In fact, it shouldn't be too much worse than the same written in C, because there's just not much happening aside from a syscall and a few tests. In fact, I'd expect the syscall to be more than 75% of the execution time of that example code. Of course, long-lived code is seldom this simple and could have all sorts of extra checks going on making it way more expensive than just the syscall. It might be decoding the frame size on every call instead of caching it, or doing lots of other unnecessary work, or even making multiple syscalls for some reason. Have you tried systrace on it to see the actual calls? If you see unexpected things, then its time to dive into the internals of SSL and see if there is a way to avoid triggering that. You could end up reporting a bug that helps everyone.

    On the other hand, just executing 40,000 syscalls even in a compiled language is expensive, and could possibly peg a CPU. By comparing the calls that return data (which execute the decode step) with the calls that don't, it could give a hint about that.

    If perl is not adding excessive overhead, then you could be up against the limits of nonblocking reads to the TCP stack. You want there to be fewer sysreads, but the TCP stack is returning smaller fragments of data than are useful to process, and waking your rogram too often. In this case, you could implement your own variant of Nagle's algorithm, and just delay a number of milliseconds after the socket is readable before performing the read, thus collecting more data per read for better throughput, and not hurting the latency too much since you are doing bulk transfers and not tiny two-way signaling. Of course this assumption only works when you know the sender is writing large packets, otherwise you'd be delaying the processing of small packets. I'd suggest making the delay dynamic, in response to the number of unproductive reads. This way, the code can auto-tune itself to systems with vastly different network or processor speeds.

      Well you seem to have explored most of those pretty well already. My point though about timing the productive vs. unproductive ssl reads was to get a feeling for the efficiency of the internals of the perl SSL implementation without (yet) diving down into the code.

      Here are some metrics for various non-blocking test cases.

      SSL:
      (I provided 3 samples because the metrics can vary, but the most common sample type is the third one)

      Interval: 1.00s Avg transfer speed: 117.61 MB/s Nb sysread calls: 78837 Productive sysread size: 16384 B Productive sysread ratio: 9.10% Avg productive sysread time: 13.86 us Avg unproductive sysread time: 6.33 us Interval: 1.00s Avg transfer speed: 117.61 MB/s Nb sysread calls: 56617 Productive sysread size: 16384 B Productive sysread ratio: 12.68% Avg productive sysread time: 20.84 us Avg unproductive sysread time: 9.35 us Interval: 1.00s Avg transfer speed: 117.61 MB/s Nb sysread calls: 35037 Productive sysread size: 16384 B Productive sysread ratio: 20.49% Avg productive sysread time: 27.57 us Avg unproductive sysread time: 16.92 us

      SSL with socket receive low water mark:
      (I provided 2 samples because sometimes SSL_WANT_READ happens during the interval, sometimes not)

      Interval: 1.00s Avg transfer speed: 117.57 MB/s Nb sysread calls: 7205 Productive sysread size: 16384 B Productive sysread ratio: 99.60% Avg productive sysread time: 25.94 us Avg unproductive sysread time: 18.23 us Interval: 1.00s Avg transfer speed: 117.61 MB/s Nb sysread calls: 7179 Productive sysread size: 16384 B Productive sysread ratio: 100.00% Avg productive sysread time: 42.37 us

      Non-SSL:

      Interval: 1.00s Avg transfer speed: 117.77 MB/s Nb sysread calls: 80621 Avg sysread size: 1460.78 B Avg sysread time: 2.73 us

      As expected, the unproductive SSL sysread calls take much more time than they should and are clearly the culprits (when socket receive low water mark is not used).


      In the best case, the implementation of IO::Socket::SSL's sysread would look something like this:
      [...]

      Unfortunately it doesn't work like that, the SSLeay library does all the job: IO::Socket::SSL::sysread() always calls IO::Socket::SSL::_generic_read(), which in turn always calls Net::SSLeay::read()


      On the other hand, just executing 40,000 syscalls even in a compiled language is expensive, and could possibly peg a CPU.

      It might be expensive, but the system can do at least twice this number of syscalls with no problem at all, as you can see in the non-SSL test metrics above. It's not the syscalls themselves which overload the CPU.


      If perl is not adding excessive overhead, then you could be up against the limits of nonblocking reads to the TCP stack.

      The TCP stack on this system can handle more than 80K non-blocking I/O by second from Perl, this is what happens when I run the test in non-SSL mode and all works perfectly fine in this case.

      Regarding the workaround which would consist in artificially delaying the sysread calls with an auto-tuning algorithm: I have already tested something similar and actually it brings a lot of new problems. For example, in order to not impact the transfer speed, the sleep durations need to be very short (especially in the case of high-bandwidth low-delay connection). Unfortunately the minimum sleep duration varies a lot depending on the OS and the hardware. For example, on Windows 10 I couldn't find ways to reliably sleep less than 15ms in Perl (*). In my simple case of 1 Gbps SSL transfer, I would need reliable sleeps around 140 us.

      (*) Windows 10 minimum sleep duration tests:
      "usleep(999);" does nothing
      "usleep(1000);" sleeps for a random number of miliseconds between 2 and 15
      "usleep(15000);" sleeps for about 15ms
      "select(undef,undef,undef,0.001);" sleeps for about 15ms
      "select(undef,undef,undef,0.000999);" sleeps for a random number of microseconds between 500 and 1000

        Well that's pretty bizarre! So for some reason, attempting to read an incomplete SSL packet takes 3x as long as reading and decoding a packet? Your results point heavily toward a bad design inside libssl itself.

        Just to be sure though, what if you make a quick change to your SSLeay.xs to verify that the problem is in SSL_read...

        PREINIT: char *buf; int got; int succeeded = 1; struct timespec ts1, ts2; long dT; PPCODE: New(0, buf, max, char); clock_gettime(CLOCK_MONOTONIC, &ts1); got = SSL_read(s, buf, max); clock_gettime(CLOCK_MONOTONIC, &ts2); dT= (ts2.tv_nsec - ts1.tv_nsec)/1000 + (ts2.tv_sec - ts1.tv_sec)*1000000; warn("SSL_read(s, buf, %ld) => %ld (%ld us)\n", (long)max, (long)got, dT); if (got <= 0 && SSL_ERROR_ZERO_RETURN != SSL_get_error(s, got) +) succeeded = 0;

        This probably only works on Linux (haven't checked if MSWin32 added clock_gettime yet) and if you aren't able to test, let me know and I will.

        It will also completely throw off the CPU usage report because now it prints on every read, but it should still be able to measure the duration of the SSL_read calls. (the other option is to capture the timings in a persistent buffer and print them at the end, but that's a lot of hassle.)

        If the timing of SSL_read matches your results above, then it would be time to head to the openssl project with a bug report. But, if it doesn't, then maybe the slowness is inside SSL_get_error or something else in the surrounding code that only happens on an unproductive read.

        And the other option is to build against a debug-symbols version of libssl, and go all the way down into it with gdb.