Marcello has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm looking for a good way to solve the following problem:

Clients connect to a machine, login and wait for messages. These messages (basically strings) are stored in a database with the appropriate client id. Each client request is forked by the main perl process. So each process should check for messages in the database (e.g. 5x per second).

When there are many clients connected, this causes a heavy load on the database. I am looking for a good solution for this.

The best I can come up with is this:
The main process which accepts clients forks another process which selects the unique client ids from the message database. This process signals the corresponding client process that there are messages waiting. After this, the corresponding process will also query the database to get the waiting messages.

This way, only one process actively queries the database. The difficulty in this approach is the process setup and fork management. The main process knows it's childs and thus the connected client, but the database poller process does not.

Any ideas/suggestions/improvements?

Marcello

Replies are listed 'Best First'.
Re: Database poller
by Zaxo (Archbishop) on Jun 15, 2003 at 01:00 UTC

    If your rdbms supports it, stored procedures would speed things up. A trigger on insertion would let you avoid polling.

    Do you remove messages as they are read? If not, a timestamp or 'seen' flag would let you reduce the size of responses.

    After Compline,
    Zaxo

      Hi Zaxo,

      I'm using MySQL so stored procedures and triggers are not yet supported unfortunately.

      I am deleting records when they are processed, so the database table is nearly always empty.

      Maybe another intermediate database table could help? I was thinking of a table where the main process (which accepts the clients) stores the currently active client processes including their process id. So if the database process finds a message, it checks the process table to see if the client to which the message should be delivered is currently logged in. All it has to do after that is signal the process that there are messages waiting.

      Maybe this is a good approach?

      Regards,
      Marcel
Re: Database poller
by mobiGeek (Beadle) on Jun 15, 2003 at 15:07 UTC
    The question is: where is the bottleneck?

    There is no point trying to "throw" optimizations at the problem if you don't know where the problem is. Changing your forking model or your schema or your queries won't help if that isn't where the problem lies.

    So, my suggestion is to profile your code; find out where in your process large amounts of time are being taken. Then fix those areas or post back with the specifics looking for particular optimizations.

    It may seem like a lot of work (possibly not-fun work) trying to determine the bottlenecks. But it sure beats reworking a whack of code only to find that the particular problem is not solved.

Re: Database poller
by EvdB (Deacon) on Jun 15, 2003 at 15:06 UTC
    Not an answer to your question but:

    You could look into getting rid of the database and instead setting up something similar to a mail folder, say Maildir. As this was designed for use with email which has many of the same traits of your problem it should work just fine. As many of the files will be cached in RAM doing this there should be no real performance penalty either due to disk access.

    Maildir does not work on Win32 due to the filenames used, but something similar should be possible.

    You could have a seperate folder for each client gaining you the security mentioned above. Backups would also become very simple. Of course if the daemons are on a different machine to the files then...

    --tidiness is the memory loss of environmental mnemonics

Re: Database poller
by aquarium (Curate) on Jun 15, 2003 at 11:22 UTC
    if indeed your bottleneck is the table holding the messages (i.e. get/delete queue gets long) then you could split up the table so that each client has it's own table. this also avoids having to store client id and check for it in every message. This will definitely relieve contention issues + improves security of data. also, check on your message lengths that they don't exceed max mysql packet length and modify accordingly, and even zip the data en-route if cpu load is low.
      It is in general a bad idea to create one (identical) table for each client (or one table for each day, etc.)

      Yes, this probalby provides a short-term performance gain, but it will lead to serious scalability and administrative headaches, and is really not the way a relational database should be used.

      If the database server is the problem then you should see what can be done to tune the server to perform better (indexes, TCP options, etc). And is it really necessary to query the database for each client 5 times per second?

      Michael

        Hi mpeppler,

        Messages should be delivered to clients asap, so at least 2 times per second. The problem is not the database at the moment, but I want a scalable solution. If every client process checks the database this way, it will eventually become a performance issue. Therefore, one process checking the database allows for many clients to connect at once.

        Regards, Marcel
        i'm sorry but you are wrong...it produces a long-term performance gain. if you have a 100 clients, yes you will need to custom build some admin interface -- you will also want one if you're using a single table, and have a jumble of messages in it. Also, with one possibly large and much sought after table, if any client mis-behaves and slows things down, all clients will suffer. you also need to be very careful when administering messages for one client as not to affect others and corrupt their data too. Scalability? if your prog is modular enough you can farm out client tables to separate databases on separate boxes. One other thing that can be done to improve performance is to have a "already read" field that you check for when retrieving messages, and only every now and then delete all those "already read" messages in one update query, so the rdbms is not shuffling data every time you take off a message from the table. The data contention issue is the most likely bottleneck you will encounter with your current design. as the number of clients grows, performance will drop very quickly for each new client addition. sometimes you do need to de-normalize your data to suit real world applications. in fact, relational theory calls for de-normalization after first normalizing your data, if that is necessary. in this case, it is.