Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?

by Yaribz (Beadle)
on Jun 23, 2022 at 16:22 UTC ( [id://11144994]=perlquestion: print w/replies, xml ) Need Help??

Yaribz has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm using the IO::Socket::SSL module in non-blocking mode and I noticed high CPU usages during data transfers (much higher than when using blocking mode, with similar speeds). After some investigations, I think this is due to a very high proportion of failed/empty sysreads due to SSL_WANT_READ.

I have based my code on following links which explain the usage of SSL sockets in non-blocking mode and how to handle SSL_WANT_READ / SSL_WANT_WRITE:

According to this documentation, it can happen that a read operation performed on a SSL socket returns nothing, even if the socket was flagged as "readable" by select just before. This happens when the data that were available in the socket buffer were just SSL transport layer data, not actual application data. And in this case, one just needs to select the socket again and retry the read operation. This makes perfect sense, and as I understand it these empty reads should only represent a small part of all the read operations performed on the SSL socket during large data transfers.

However, in my case (when performing large transfers between two hosts on a 1 Gbps network), these "empty" reads due to SSL_WANT_READ represent about 80% of all the read operations performed on the socket, which likely explains the high CPU usage. So I guess I'm doing something wrong in my application...

I managed to reproduce the problem with these two small scripts:

Here is the server code, which runs in blocking mode and sends data continuously to any client connecting to it (only one client at a time). It works fine, with low CPU usage, and is only provided to help reproduce the problem when using non-blocking socket at client side:

use warnings; use strict; use IO::Socket::SSL; use constant { BUFSIZE => 16384 }; die 'Invalid BUFSIZE value (must be a multiple of 16)' if(BUFSIZE%16); my $LOCAL_ADDR='0.0.0.0'; my $LOCAL_PORT=1234; my $BUFFER = '0123456789ABCDEF' x (BUFSIZE/16); my $srvSock; $srvSock=IO::Socket::SSL->new( LocalHost => $LOCAL_ADDR, LocalPort => $LOCAL_PORT, ReuseAddr => 1, Listen => 1, SSL_cert_file => 'cert.pem', SSL_key_file => 'key.pem', ) or die "Failed to create listening SSL socket: $!"; print "SSL server listening on $LOCAL_ADDR:$LOCAL_PORT\n"; while() { print "Waiting for client to connect...\n"; my $clientSock = $srvSock->accept() or die "Failed to accept SSL client connection: $!, $SSL_ERROR\n"; print "Client connected, sending data.\n"; while() { my $writeLength=syswrite($clientSock,$BUFFER) or do { print " Failed to write to client: $!, $SSL_ERROR\n"; last; }; if($writeLength != BUFSIZE) { print " Unexpected write length: $writeLength\n"; last; } } }

And here is the client code, which runs in non-blocking mode and reports every second the transfer speed and the proportion of the read operations which "failed" due to SSL_WANT_READ and SSL_WANT_WRITE:

use warnings; use strict; use IO::Socket::SSL; use Time::HiRes qw'time'; use constant { BUFSIZE => 16384, INTERVAL => 1, }; my $PEER_ADDR='192.168.1.10'; my $PEER_PORT=1234; print "Connecting to SSL server ($PEER_ADDR:$PEER_PORT)\n"; my $clientSock = IO::Socket::SSL->new( PeerHost => $PEER_ADDR, PeerPort => $PEER_PORT, Proto => 'tcp', SSL_verify_mode => SSL_VERIFY_NONE, ) or die "Failed to connect to SSL server: $!, $SSL_ERROR"; print "Connected, switching to non-blocking mode.\n"; $clientSock->blocking(0); my $sockVec=''; vec($sockVec,fileno($clientSock),1)=1; my ($nbRead,$nbSslWantRead,$nbSslWantWrite,$transferred)=(0,0,0,0); print "Downloading data from server...\n"; my $currentTime=time(); my ($intvlStart,$intvlEnd)=($currentTime,$currentTime+INTERVAL); while() { # There should be no pending data when BUFSIZE = max SSL frame size die 'Unexpected pending data in SSL socket' if($clientSock->pending()); $!=0; select(my $readyVec=$sockVec,undef,undef,undef) > 0 or die "Error in select for read: $!"; my $readLength=sysread($clientSock,my $readData,BUFSIZE); $nbRead++; if(defined $readLength) { die "Connection closed by peer" unless($readLength); die "Unexpected read length: $readLength" unless($readLength == BUFSIZE); }else{ die "Failed to read from SSL socket: $!" unless($!{EWOULDBLOCK} || $!{EAGAIN}); if($SSL_ERROR == SSL_WANT_READ) { $nbSslWantRead++; next; } if($SSL_ERROR == SSL_WANT_WRITE) { $nbSslWantWrite++; select(undef,my $readyVec=$sockVec,undef,undef) > 0 or die "Error in select for write: $!"; next; } die 'Unexpected WOULDBLOCK/EAGAIN status when trying to read'; } $transferred+=BUFSIZE; $currentTime=time(); if($currentTime >= $intvlEnd) { printReport(); initNewInterval(); } } sub printReport { my $speed=formatSize($transferred/($currentTime-$intvlStart)); my $pctSslWantRead=sprintf('%.2f',$nbSslWantRead*100/$nbRead); my $pctSslWantWrite=sprintf('%.2f',$nbSslWantWrite*100/$nbRead); print "Transfer speed: $speed/s\n"; print " Read failure due to SSL_WANT_READ: $pctSslWantRead%\n"; print " Read failure due to SSL_WANT_WRITE: $pctSslWantWrite%\n"; } sub formatSize { my $size=shift; my @UNITS=('',qw'K M G T'); my $unitIdx=0; while($size >= 1000 && $unitIdx < $#UNITS) { $size/=1000; $unitIdx++; } $size=sprintf('%.2f',$size) if(index($size,'.') != -1); return $size.' '.$UNITS[$unitIdx].'B'; } sub initNewInterval { ($nbRead,$nbSslWantRead,$nbSslWantWrite,$transferred)=(0,0,0,0); $intvlStart=$currentTime; $intvlEnd+=INTERVAL while($intvlEnd < $intvlStart+INTERVAL/2); }

To use these scripts a SSL certificate must be placed in current server directory (files "cert.pem" and "key.pem"), and the $PEER_ADDR variable must be set to the server address in the client script. Default port is 1234 and can be changed by editing the $LOCAL_PORT / $PEER_PORT variables.

Initially I tried to reproduce the problem by running both scripts (server and client) on same system, using localhost, but the problem doesn't appear in this case (the proportion of failed reads due to SSL_WANT_READ is 0.00%).

Here are the results when I run the client and the server parts on distinct systems, for 10 seconds:

$ perl sslclinb.pl Connecting to SSL server (192.168.1.10:1234) Connected, switching to non-blocking mode. Downloading data from server... Transfer speed: 117.55 MB/s Read failure due to SSL_WANT_READ: 87.46% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 90.84% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 90.84% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 90.88% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 90.50% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 82.52% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 82.63% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 82.58% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 82.58% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 82.57% Read failure due to SSL_WANT_WRITE: 0.00% ^C

I get similar results on 2 different systems (i.e. when exchanging server and client), on Windows and Linux.
I would be very interested to know if anyone reproduces the same behavior, and if anyone has an idea on what is going on?

Thanks !

Server Error (Error ID 1809024c2136177)

An error has occurred. The site administrators have been notified of the problem and will likely soon fix it. We thank you, for you're patients.

Replies are listed 'Best First'.
Re: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?
by haukex (Archbishop) on Jun 23, 2022 at 17:40 UTC

    I unfortunately can't comment on the low-level issue you're experiencing here, all I can say is that nowadays, I would not implement this kind of thing on such a low level, and use some event loop framework like POE or Mojo::IOLoop instead. I showed an example of a simple TCP client/server here, you should be able to extend that by adding the appropriate tls options. Perhaps using such a ready-made higher-level framework will help you avoid this low-level issue.

      Thanks for your input, however I can replicate the problem using a client based on Mojo::IOLoop, like this for example:

      use warnings; use strict; use IO::Socket::SSL 'SSL_VERIFY_NONE'; use Mojo::IOLoop; use Time::HiRes qw'time'; use constant { BUFSIZE => 16384, INTERVAL => 1, }; my $PEER_ADDR='192.168.1.10'; my $PEER_PORT=1234; my $transferred=0; my ($currentTime,$intvlStart,$intvlEnd); Mojo::IOLoop->client( { address => $PEER_ADDR, port => $PEER_PORT, tls => 1, tls_options => {SSL_verify_mode => SSL_VERIFY_NONE}, tls_verify => SSL_VERIFY_NONE }, sub { my ($loop, $err, $stream) = @_; if(defined $err) { print "Failed to connect to SSL server: $err\n"; return; } print "Connected, downloading data from server...\n"; $currentTime=time(); ($intvlStart,$intvlEnd)=($currentTime,$currentTime+INTERVAL); $stream->on(close => sub {die "Connection closed by peer"}); $stream->on(error => sub { die "Failed to read from SSL socket: $_[1]" }); $stream->on(read => sub { my ($stream,$readData)=@_; my $readLength=length($readData); die "Unexpected read length: $readLength" unless($readLength == BUFSIZE); $transferred+=BUFSIZE; $currentTime=time(); if($currentTime >= $intvlEnd) { printReport(); initNewInterval(); } }); }); sub printReport { my $speed=formatSize($transferred/($currentTime-$intvlStart)); print "Transfer speed: $speed/s\n"; } sub formatSize { my $size=shift; my @UNITS=('',qw'K M G T'); my $unitIdx=0; while($size >= 1000 && $unitIdx < $#UNITS) { $size/=1000; $unitIdx++; } $size=sprintf('%.2f',$size) if(index($size,'.') != -1); return $size.' '.$UNITS[$unitIdx].'B'; } sub initNewInterval { $transferred=0; $intvlStart=$currentTime; $intvlEnd+=INTERVAL while($intvlEnd < $intvlStart+INTERVAL/2); } print "Connecting to SSL server ($PEER_ADDR:$PEER_PORT)\n"; Mojo::IOLoop->start();

      When I perform the test using this Mojo-based non-blocking SSL client, I have a CPU core which is used at 100% (~40% us, ~10% sy, ~50% si), exactly like with my "low-level" non-blocking SSL client implementation. The problem is that when I use this Mojo-based client, I don't have access to the number of sysread calls which failed due to SSL_WANT_READ, so I have less information to provide for debugging.

      For the record, if I use a blocking SSL client (still in Perl) instead, the highest CPU core usage for a 1Gbps transfer is at 55% on this system...

Re: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?
by NERDVANA (Deacon) on Jun 24, 2022 at 01:43 UTC
    A few aspects of your code bother me. I could investigate the details, but I apologize that I'm feeling lazy. But, maybe just asking these questions would send you on the right path to discovery?

    Your code is expecting to read and write entire 16K chunks, and even throws an error if it doesn't get a full 16K chunk. This goes against the normal API of sysread, though of course in this case it is virtual. Does SSL protocol or this particular implementation guarantee whole frames like this? Is 16K a magic SSL number that you can rely on, or will it sometimes give you smaller chunks? Or does it guarantee that you receive the whole chunks of whatever the other side sent?

    Packets have to be smaller than 1.5K for ethernet, so many physical receives will happen before you reach 16K, as you noticed in your subsequent post. Might your choice of 16K be the actual problem that causes lots of looping? You mention configuring the "low water mark" on the socket, but that seems wrong. Your application packets are 16K, but the underlying TCP packets could be a variety of sizes, so insisting at the OS level that it not return data until 16K is available seems likely to break things (even if it didn't this time).

    Using Time::HiRes, have you tried measuring the actual overhead of an "unproductive" sysread on IO::Socket::SSL which reads the underlying socket but delivers nothing? How does this compare to the time spent on a productive sysread? It seems to me that an unproductive sysread shouldn't take much longer than a sysread on the underlying socket. If you call ->sysread as a class method instead of a tied-IO global function, is there any performance difference?

    Have you verified that the code really does wait on select()? If fileno() didn't work as expected and the select() always returned immediately, it would result in really high CPU usage with no other hints that something was wrong.

      A few aspects of your code bother me. I could investigate the details, but I apologize that I'm feeling lazy. But, maybe just asking these questions would send you on the right path to discovery?

      Sure, thanks for trying to help !


      Your code is expecting to read and write entire 16K chunks, and even throws an error if it doesn't get a full 16K chunk. This goes against the normal API of sysread, though of course in this case it is virtual. Does SSL protocol or this particular implementation guarantee whole frames like this? Is 16K a magic SSL number that you can rely on, or will it sometimes give you smaller chunks? Or does it guarantee that you receive the whole chunks of whatever the other side sent?

      This is all explained in the second link of my original post (section "Polling of SSL sockets"). I provided these links especially because IO::Socket::SSL has some specific aspects like this one. I could try to paraphrase here, but I'm afraid I wouldn't be as accurate as the original explanation. The fact that the server part is using this application buffer size is also playing a role here, this is what makes the sending SSL layer use the max SSL frame size. If you want you can consider the "die" instruction as a debug/check instruction. As you see the "die" is never executed in my case so it is safe to consider the sysread instructions at client side always return 16Ki data in this specific test case. The "die 'Unexpected pending data in SSL socket'" instruction is exactly the same: a debug instruction to show this never happens in this test case.


      Packets have to be smaller than 1.5K for ethernet, so many physical receives will happen before you reach 16K, as you noticed in your subsequent post. Might your choice of 16K be the actual problem that causes lots of looping?

      I'm not sure I understand your reasoning. Can you tell me which size you think would be better to use as sysread parameter, and then I will try to explain why I think it wouldn't improve anything ?
      Also, as you can see in my previous post, Mojo::IOLoop suffers from the exact same problem (a CPU core is used at 100% when performing a transfer) whereas it uses a hardcoded value of 128Ki for the application buffer size


      You mention configuring the "low water mark" on the socket, but that seems wrong. Your application packets are 16K, but the underlying TCP packets could be a variety of sizes, so insisting at the OS level that it not return data until 16K is available seems likely to break things (even if it didn't this time).

      For the general case of course it is wrong. SO_RCVLOWAT comes with his own challenges when trying to use it in the general case. I only used SO_RCVLOWAT to check the origin of the problem really was what I though it was, in my specific test case (case of large data transfer on non-blocking SSL socket).


      Using Time::HiRes, have you tried measuring the actual overhead of an "unproductive" sysread on IO::Socket::SSL which reads the underlying socket but delivers nothing? How does this compare to the time spent on a productive sysread? It seems to me that an unproductive sysread shouldn't take much longer than a sysread on the underlying socket.

      It's not that simple. An "unproductive" sysread from application point of view is actually productive from SSL layer point of view: data are copied from the socket buffer to the internal SSL buffer. So each "unproductive" sysread call will actually make the later "productive" sysread call lighter, because a part of the SSL frame will already be inside the internal SSL buffer. The only way to measure the true overhead is to compare a productive sysread call with no preceding unproductive sysread call, to a productive sysread call with N preceding unproductive sysread calls.
      But anyway, I wouldn't say there is an overhead in "unproductive" SSL sysread compared to "productive" SSL sysread, I would say there is an overhead in any SSL sysread (even the "unproductive" ones which don't decrypt anything yet, and that is the main problem in my opinion) compared to non-SSL sysread. That would explain why, in the case of large data transfer using SSL, it's important to save superfluous sysread calls, and not try to read before an entire SSL frame is available (or at least not after each single TCP packet is received).


      If you call ->sysread as a class method instead of a tied-IO global function, is there any performance difference?

      No, in both cases I have about 40000 sysread calls by second, and a CPU core is used at 100% instead of 55%


      Have you verified that the code really does wait on select()? If fileno() didn't work as expected and the select() always returned immediately, it would result in really high CPU usage with no other hints that something was wrong.

      Yes, I checked that already, by adding a sleep in the send code at server side, and checking the resulting number of sysread calls at client side. All works as expected. Also, the simple fact that using SO_RCVLOWAT instantly "fixes" the problem without touching the select code is a good indicator that the select call is working as expected.

        Well you seem to have explored most of those pretty well already. My point though about timing the productive vs. unproductive ssl reads was to get a feeling for the efficiency of the internals of the perl SSL implementation without (yet) diving down into the code.

        In the best case, the implementation of IO::Socket::SSL's sysread would look something like this:

        sub sysread(...) { if ($self->{decoded_remaining}) { return ... } else if (sysread($self->{underlying_socket}, $self->{buffer}, ...) +> 0) { $self->{next_frame_size} //= decode_frame_size($self->{buffer}); if (length $self->{buffer} >= $self->{next_frame_size}) { ... # decode it return ... } $SSL_ERROR= ...; return undef; } }

        and that should run reasonably efficiently if you call it a bunch of times and the length() test is false. In fact, it shouldn't be too much worse than the same written in C, because there's just not much happening aside from a syscall and a few tests. In fact, I'd expect the syscall to be more than 75% of the execution time of that example code. Of course, long-lived code is seldom this simple and could have all sorts of extra checks going on making it way more expensive than just the syscall. It might be decoding the frame size on every call instead of caching it, or doing lots of other unnecessary work, or even making multiple syscalls for some reason. Have you tried systrace on it to see the actual calls? If you see unexpected things, then its time to dive into the internals of SSL and see if there is a way to avoid triggering that. You could end up reporting a bug that helps everyone.

        On the other hand, just executing 40,000 syscalls even in a compiled language is expensive, and could possibly peg a CPU. By comparing the calls that return data (which execute the decode step) with the calls that don't, it could give a hint about that.

        If perl is not adding excessive overhead, then you could be up against the limits of nonblocking reads to the TCP stack. You want there to be fewer sysreads, but the TCP stack is returning smaller fragments of data than are useful to process, and waking your rogram too often. In this case, you could implement your own variant of Nagle's algorithm, and just delay a number of milliseconds after the socket is readable before performing the read, thus collecting more data per read for better throughput, and not hurting the latency too much since you are doing bulk transfers and not tiny two-way signaling. Of course this assumption only works when you know the sender is writing large packets, otherwise you'd be delaying the processing of small packets. I'd suggest making the delay dynamic, in response to the number of unproductive reads. This way, the code can auto-tune itself to systems with vastly different network or processor speeds.

Re: IO::Socket::SSL / Net::SSLeay inefficient in non-blocking mode ?
by Yaribz (Beadle) on Jun 23, 2022 at 22:01 UTC

    Actually, after thinking a bit more about it, I think I understood what happens.

    I though that SSL_WANT_READ could happen only when some SSL specific transport data were in the socket buffer (TLS handshakes, TLS session tickets etc.), which doesn't happen very often compared to real payload data (at least when large transfers are performed). But I totally forgot that it also happens each time an incomplete SSL frame is in the socket buffer, which can basically happen for each single TCP packet received, as long as there aren't enough packets received to complete the current SSL frame...

    So, there can be at most MAX_SSL_FRAME_SIZE / TCP_FRAME_SIZE failed sysread calls preceding each successful sysread call (if select/sysread is called between each received TCP packet).
    In my case, TCP_FRAME_SIZE = MSS (1460) because I'm performing large transfers.
    Given that MAX_SSL_FRAME_SIZE is 16384, it means that in the worst case, I can have 16384/1460 = ~11.22 => 11 failed sysread calls (empty) preceding one successful sysread of 16384 bytes.
    Worst case read failure rate = 11/12 = ~91.67%, which is indeed not far from the worst case I observed.

    Of course it all depends at which rate the application is trying to read/empty the socket buffer. Ideally, in the SSL case, I guess the application should not try to read anything before there is at least one complete SSL frame in the buffer, to avoid wasting I/Os.

    I wanted to confirm, so I tried to use the SO_RCVLOWAT (socket receive low water mark) socket option to tell the kernel the select call should not succeed on the SSL socket before there are at least 16384 bytes ready to be read. It only consisted in adding the following 3 lines to my non-blocking SSL client:

    use Socket qw'SOL_SOCKET SO_RCVLOWAT'; ... setsockopt($clientSock,SOL_SOCKET,SO_RCVLOWAT,BUFSIZE) or die "Failed to set SO_RCVLOWAT on socket: $!";

    And as expected, the results are much much better (very low read failure rate due to SSL_WANT_READ):

    $ perl sslclinb.pl Connecting to SSL server (192.168.1.10:1234) Connected, switching to non-blocking mode. Downloading data from server... Transfer speed: 117.54 MB/s Read failure due to SSL_WANT_READ: 0.33% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 0.31% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 0.55% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 0.51% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 0.43% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 0.33% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 116.13 MB/s Read failure due to SSL_WANT_READ: 0.34% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 0.10% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 0.40% Read failure due to SSL_WANT_WRITE: 0.00% Transfer speed: 117.61 MB/s Read failure due to SSL_WANT_READ: 0.47% Read failure due to SSL_WANT_WRITE: 0.00% ^C

    And as expected also, the CPU usage has decreased a lot and is now similar to that of the implementation based on blocking socket.

    Unfortunately the SO_RCVLOWAT socket option is not available in all systems, for example Windows doesn't support it. So it would be nice if there was another solution to avoid overloading the CPU and wasting I/Os when using non-blocking socket with SSL...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11144994]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2024-03-29 15:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found