Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I have a basic networking/IPC question.

Here's my imaginary setup:

When the server receives a request for the price of bananas, it needs to perform a two-stage query. First, it needs to ask each database node, "How many bananas do you have?". After it hears back from all the machines, it tallies up the total number of bananas and prepares a second query: "There are $count bananas available today. What do each of you want to charge for bananas?" Once it hears back from all the back-end machines, the server then averages the prices and serves that number.

How should the server and the back-end machines communicate with each other?

I'm not really looking for an answer about how the problem could be solved with better database design -- in real life, the problem isn't exactly how I've described it. :) What I need is a basic understanding about what my options are for rapid communication between several machines on an internal network.

TIA!!!

  • Comment on Rapid inter-machine communication on internal network

Replies are listed 'Best First'.
Re: Rapid inter-machine communication on internal network
by shmem (Chancellor) on Oct 30, 2006 at 15:45 UTC
    To your server the database nodes appear as servers as well. You can speed things up if you use
    1. persistent connections
    2. select on the socket file handles to read responses as soon as they arrive

    You might want to have a look at POE.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      When you say "persistent connections", I think DBI. I know both from experimentation and documentation that establishing a connection to a database is expensive. DBI isn't involved in this project though (despite the database metaphor), so I guess what I need to know is how to do what DBI is doing.

      Forgive my ignorance, but just what does "persistent connection" mean in this context? How do I get one?

      You might want to have a look at POE.

      Thanks for the tip. It looks daunting.

      I guess I need to bone up on the fundamentals of sockets programming first. Judging from the responses so far, they're the answer. Since I've never had to solve this problem before, I wasn't sure if that was the right place to look, or if there was something else I should be checking out too.

        "Persistent connection" in this context means, that the communicating sockets on each side are set up and showing up as "ESTABLISHED" in the network connections table, which means, the server has done an accept(), just sits there and is ready to read from the socket whatever the client may tell it.

        Persistent means also that there's no shutdown of the connection after some talk or after some idle time, so whenever a request comes in to the dispatching server, it can query the database nodes without having to set up the connection anew with all overhead of socket allocating, listen, bind, accept calls, SYN,SYN-ACK,ACK handshake and so on. Pretty much like a "leased line" vs. "dialup connection".

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Rapid inter-machine communication on internal network
by wjw (Priest) on Oct 30, 2006 at 14:41 UTC
    Neet abstraction/description of your challenge! Based on what you have described, I would wrap my initial (how many bananas) queries individually and use a dispatch routine to keep track of when I get all answers back, maintaining an open session to each of my back end servers. Then do the same with my second query on the still open sessions. Once the answers are recieved, close the sessions, unless your asking the questions every few seconds...

    There are a lot of open questions here though:

    1. why the need for speed? Because there are many such queries? or because the number of bananas on each machine or their price changes rapidly?
    2. how important is the time between answers to the query from machine to machine?
    3. is an (held)open session debtrimental?

    Hope I didn't completely miss the point.

    ...the majority is always wrong, and always the last to know about it...

      why the need for speed? Because there are many such queries?

      Split-second response time is of great value to the end user in this case. They really, really need to know the price of bananas right now. And yes, there are a lot of queries. Everybody loves bananas!

      how important is the time between answers to the query from machine to machine?

      For some commodities, it's easy to calculate a price and the worker nodes will finish their calculations quickly. For others it's tougher, and the worker nodes will have to churn for a while.

      When the worker nodes can finish their tasks quickly, it's important that the inter-machine communication time doesn't become a bottleneck that degrades the apparent response time from the user's perspective.

      Also note that because we need to know the total number of bananas across all nodes before any node can start calculating a price, we have a situation where the worker nodes as a group are only as fast as their slowest member. The same situation comes into play when calculating the final price to serve to the user -- we have to know what all the nodes want to charge before we can determine the final price. This is especially important if only one node has bananas.

      So we need a strategy that has good worst-case-scenario performance.

      is an (held)open session debtrimental?

      I don't know. I've never done any sockets programming.

      Probably it's better. When would it be bad?

        Ok, so you have a large number of repetative queries that need to produce fast results on demand. Interesting that you stated "apparent" response time. That would imply that some fudging could take place.. but I won't go there.

        My experience with this is in using Perl::DBI and MySQL. Because the answer you are passing back from the DB servers is basicly a single value, bandwidth should not be a problem. Holding the connection open (persistent) reduces time lag a great deal. The only time you would not do that is if you are pushing the number of allowed connections to your DB server(s).hmmmmm ...

        You have busy DB servers so I would try to use persistent connections for the least amount of time. Maybe just make sure that you close each connection as soon as you get the second answer back from each server so that you are not hogging connections.

        Percieved time delay is going to be database answer dependant... not network connection dependent... (IMO). You might also keep your fast query answers on the "answer" server(the one that calcs the averages) for as long as there is an open connection to you slowest server, not allowing any further connections to the fast servers until the slowest one responds, do the calcs, then close that last connection. That way you don't end up with a bunch of meaningless queries clogging up the network.

        Hope this makes sense :-)

        ...the majority is always wrong, and always the last to know about it...

Re: Rapid inter-machine communication on internal network
by cdarke (Prior) on Oct 30, 2006 at 15:02 UTC
    "Rapid" is relative. Using sockets is an obvious mechanism, and Datagram (UCP) type are the fastest, but stream sockets (TCP) are generally easier to synchronise. Which mechanism depends on other factors as well, like volumes and security.
    You should also consider what happens when two requests for a banana come in when there is only one left. Each client will be told there is one banana, yet one must fail. Maybe transaction handling is what you mean by database design.
Re: Rapid inter-machine communication on internal network
by jmuhlich (Acolyte) on Oct 31, 2006 at 00:36 UTC

    In general this is referred to as RPC -- Remote Procedure Call. RPC is a generic term for an inter-machine request/response mechanism. Clients make structured requests to a server over a network connection, then the server does some processing and returns a structured response. There are some RPC modules on CPAN, but they either wrap an existing C library and are thus unperlish and hard to use, or are too slow for serious load.

    One project I've been working on needed a nice RPC library to do the exact sort of thing you're talking about (communication between a bunch of back-end servers collaborating to provide a publicly-visible service), so we created RPC::Lite to solve the problems I mentioned above. We've focused on designing something easy to use from Perl with a simple interface, while keeping performance in mind. For example, method signatures are completely optional just like in Perl itself, but perhaps a future version of the code will be able to go faster if you provide them. The client-side library has support for calling server methods asynchronously, which can help your front end remain responsive while waiting for answers from the back ends. The server-side library has support for threading, so a long-running calculation won't prevent other responses from being handled by that server.

    We released a preliminary version of RPC::Lite to CPAN only last week. The version is 0.10 but the basic functionality is there and it's working great in our project. We'd love for people to grab it and see how they like it.

Re: Rapid inter-machine communication on internal network
by SheridanCat (Pilgrim) on Oct 31, 2006 at 03:55 UTC
    I haven't ever tried anything like this, but perhaps this might work.

    Could you put a middle layer in that functions as an aggregation point for all the count and price data? Then, have the backend servers (holding banana counts and prices) contact the middle tier whenever their data changes?

    The front end would then just make a single call to the middle layer. From the end user perspective, it should be faster than hitting each backend server individually. And, since the backend servers update their count/price only when necessary, you can keep the total transactions down.

      Could you put a middle layer in that functions as an aggregation point for all the count and price data?

      No, that is impossible. The communication absolutely must proceed in two stages.

      In real life, we're not just talking bananas. We're talking bananas and thousands or millions of other items, and minute-by-minute changes. The cost to update a centralized repository for information about all possible items is prohibitive. If an aggregation layer were feasible, I would implement it. I agree that it would be a good approach.

      In the absence of such a solution, streamlining the communication between the boss node and the worker nodes becomes doubly crucial.

        Gotcha. I'm also having this nagging notion that concurrency issues are going to arise as you are updating/reading data.
Re: Rapid inter-machine communication on internal network
by sfink (Deacon) on Oct 31, 2006 at 06:55 UTC
    I wrote one of those once. It was no different than what other people have been saying. Sketching it out, it looks like:
    our @servers; our $server_fdset = ''; # Fill in the IP addresses/names of all of the servers sub setup { for my $server (@servers) { $server->{socket} = IO::Socket::INET->new(Proto => 'tcp', PeerAddr + => "$server->{host}:$server->{port}") or die "connection to server $server->{name}\n"; # In reality, you should do something smarter if a server fails -- + return an incomplete result, or have redundant servers vec($server_fdset, fileno($server->{socket}), 1) = 1; } } sub query { my ($what) = @_; for my $server (@servers) { $server->{socket}->syswrite("how many $what you got, dude"); } my $num_bananas = 0; my $done = 0; my $fds = $server_fdset; while ($done < @servers) { my $rfds; select($rfds = $fds, undef, undef, undef); for my $server (@servers) { if (vec($rfds, fileno($server->{socket}), 1)) { my $buf; $server->{socket}->sysread($buf, 4); my $result; unpack("L", $buf); $num_bananas += $result; # Done with this server $done++; vec($fds, fileno($server->{socket}), 1) = 0; } } } # Repeat, using the new request }
    Obviously, that's incomplete and totally untested. And you shouldn't really be calling sysread or syswrite without checking the return value. You might want to use something else for reading and/or writing so you don't have to deal with short reads/writes. And, of course, if a server goes down, you probably don't want to hang forever waiting for it to return a response (especially if you got a zero-length read from it that tells you it shut down.) You might also want to use IO::Select instead of raw select, and you might want to store your servers in a hash or array keyed by the file descriptor, etc.

    Another thing I did was to duplicate each server. So every machine was responsible for two different partitions of the data. It helped for failover, but I also got (perhaps excessively) clever: I would first send out a query for the "main" partition for each server. Then, if server A (handling partitions 1 and 6) returns a result for partition 1 before server B (handling partitions 6 and 3) returns its result for partition 6, I'd send the query to server A for partition 6. It tangles up the logic a bit, because you have to keep track of state for each server. But if your response time variance is high, you can speed things up quite a bit. (You could obviously send both requests to each server initially, but that tends to bog down the server farm. Depends on what your load looks like -- it trades capacity for latency.)

    I had some even wackier stuff, so I ended up abstracting it out some: each server structure had a code ref storing its next action, and a slot for storing a description of what it should wait for before executing that next action. (The exact pattern of queries was dependent on the results of earlier queries, sometimes just from the same server, sometimes from others.)

    Finally, it's nice to keep a timestamp of the transmission of every request and the response's arrival. I wrote a little app that took all of those timestamps and graphed them so you could immediately see where the "long pole" was, and if the pattern looked more or less correct. It was useful in debugging as well as optimization. And it looked cool, especially since it picked a different color for every request..response line.

    Today, I'd probably take the plunge and learn POE.

      I feel kind of lame promoting my own module, but this is what that would look like with RPC::Lite, following the same sort of structure. It's not too much shorter, but it's far more concise and less prone to errors from using the lower-level calls yourself. This example would get a few lines shorter with some of the improvements I have in mind for the API (client construction and response pumping stuff). Note that this code is tested and working.
      use RPC::Lite::Client; our @clients; our @server_info; sub setup { foreach my $info (@server_info) { push @clients, RPC::Lite::Client->new({ Transport => "TCP:Host=$info->{host},Port=$info->{port}", # default serializer is JSON, and XML is also available }); } } sub query { my ($what) = @_; my ($responses, $result); # anon sub will be called with any responses from the server foreach my $client (@clients) { $client->AsyncRequest( sub { $responses++; $result += $_[0] }, 'getCount', $what ); } # The API should support aggregating multiple clients so you can # replace the inner foreach with one call, but it doesn't yet. while ($responses < @clients) { foreach my $client (@clients) { $client->HandleResponse; # pump server for data } } return $result; }
        I feel kind of lame promoting my own module, but this is what that would look like with...
        I should probably post this in Meditations rather than burying it here -- but please don't feel lame about something like this (pointing out that you've written a module to solve a problem under discussion). It's one of the primary ways I find out about new modules, or find out what the intent is behind one that I already knew about.

        I'm certainly not going to blame you for helpfully pointing out that you have written some code and released it for me to freely use -- code that happens to implement exactly what I want (or something similar enough that you think it's what I want!)

      Oh and regarding POE, I definitely want to add POE support on both the client and server side. Also I have plans for mod_perl 2 support on the server side.
Re: Rapid inter-machine communication on internal network
by halley (Prior) on Oct 31, 2006 at 14:05 UTC
    If the server can be trusted to interpolate on a market schedule in many cases, ask the clients both questions at once.

    "How many bananas do YOU have NOW, and while you're at it, what's your schedule of pricing assuming a market supply of X, Y, or Z bananas? If it's not a simple interpolatable schedule, i.e., if you use complicated models when the market isn't exactly X Y or Z, let me know that too and I'll have get back to you."

    If it's nearly always a complicated model or the affine schedule is really huge, then this approach is less valuable.

    --
    [ e d @ h a l l e y . c c ]

Re: Rapid inter-machine communication on internal network
by pajout (Curate) on Oct 31, 2006 at 15:39 UTC
    Can you quantificate the problem? For instance:
    Amount of the nodes
    Average and maximal response times (separatelly for amount and prices)
    Frequency and amount of data changing on the nodes
    Amount of comodities
    ...
Re: Rapid inter-machine communication on internal network
by Rhandom (Curate) on Oct 31, 2006 at 17:36 UTC
    How does the number of bananas get updated on the nodes? Is there any calculation going on on the nodes - or just a summing of the bananas? How often do the numbers change? Do you know the number of bananas at the time the number changes?

    You may be lucky enough to be able to write out a static HTML page containing the number of bananas. Nothing will be able to query as fast as this.

    But even if you do have to use a cgi script or mod_perl (I'd heavily suggest mod_perl), the following algorithm will work fine - and there is no threading or forking.

    my @socks; for my $host (@hosts) { my $sock = IO::Socket::INET->new(PeerAddr => $host, PeerPort => 80); die "$host failed: $!" if ! $sock; print $sock "GET $path HTTP/1.0\r\nHost: $host\r\n\r\n"; push @socks, $sock; } my $sum = 0; for my $sock (@socks) local $/ = ''; my $resp = <$sock> $sum += parse_resp($resp); }


    This code works and is as fast as your slowest server because: You have asked all of the servers the question which takes near 0 time (assuming connection time isn't the issue). Each of the servers can take however long they need to generate the response. Then you begin asking for the answers. As long as the transport isn't an issue (on a local network it better not be), then you basically asked all of the servers for the answer and got the results back in about the time that it took for the slowest server to generate the response. Servers that get done more quickly simply have the result waiting for you in the open $sock handles. Ones that take longer will block while you read $sock - but that is OK because other servers will still be creating their responses while you wait for the slow one.

    You can try things more complicated - but chances are very, VERY high that the extra complication will introduce its own overhead and bottlenecks and won't deliver the results any faster. (If connection time or transport are issues then all of this goes out the window - but it is doubtful connection or transport will be the bottleneck for this application).

    my @a=qw(random brilliant braindead); print $a[rand(@a)];
      How does the number of bananas get updated on the nodes? Is there any calculation going on on the nodes - or just a summing of the bananas? How often do the numbers change? Do you know the number of bananas at the time the number changes?

      The cost for a worker node to figure out how many bananas it has is next to nothing. So the nodes will be able to answer back on the first loop right away.

      The cost for a worker node to figure out how much it should charge for a banana varies. It can be small; it can be quite substantial. There calculation is very complex.

      Updates are potentially very frequent, and we have to plan as if they are occurring all the time.

      This code works and is as fast as your slowest server because: You have asked all of the servers the question which takes near 0 time (assuming connection time isn't the issue).

      This is cool, though the web-page thing won't work out. However, these apps are all persistent, and from what I gather we can establish enduring socket connections between them. So we can iterate over an array of existing connections instead of hostnames. Your point about blocking still applies. The amount of data transmitted between boss and worker nodes is small, so transmission time won't be an issue unless some unknown factor gets in the way and interferes with a connection. From what I read in this thread, I gather no such problem is likely to arise at the level of the internal network when using sockets (assuming that we're nowhere near network capacity).

      We still need to handle the case of individual worker nodes malfunctioning, but since the first loop has a tiny cost, it can also serves as verification that everybody's present and accounted for.