Let's work from your small library example with 1000 books and 50 servers. Let's say you have 500 users. Assume it's a small library of highly specialized volumes that have a lot of check-out contention. A username and password isn't much data to need to scale, but the information about all the books the user has checked out and checked back in could be. There's of course the data about the books themselves, too. Then there's the load of queries for your 500 users. Let's say 100 users at any time are hitting your database application. Each server ideally (without accounting for storage redundancy here) stores information about around 20 volumes and about 100 users.

Let's assume your brute-force method first. If you query every server for every login, every check-out record, every check-in record, and every book that's part of the in-out records just to make a history of the 5 books a user has checked out, then you have 800 queries (50 for the login, 50 * 5 * 3 for the check-in, check-out, and book data entries). If you have 5% of your concurrent users (5 people) asking for their recent checkout history (averaging 5 books), you're dealing with 4,000 queries before even considering the other users. To find a specific book's record to see the summary info about it, you're either hitting all 50 server or you're doing a short-circuit linear search for an average of 25 queries. You can't short-circuit the check-in and check-out queries mentioned before since there can be multiples of those.

Now, let's say that there's a very small table (or even just a configuration file for the application, but we'll assume you'll use the DB for it for ease of update) on every server which gives some hashing info. Let's assume for ease of hashing that users have an ID number of at least two digits that's all digits for their username. Users with IDs ending in 42 and 84 have their login data on server 42, while users with ID numbers ending in 09 and 18 have their info on server 9. The last two digits of the ISBN number map just as well to 50 servers -- either the digits are the server number or twice the server number (00 goes to server 50). So, you now have one query for the user and one query for each book with just a case statement for overhead. With 5 users gravving their checkout history at an average of 5 books checked out, your system handles 1 + 5 + 5 queries each, or 55 total. That's a drop of 86.75% in queries for that type of operation. To find a particular book's record, you issue 1 query. That's a drop of 92 to 96% vs. querying the servers in order.

Your additional servers are offering you additional storage, but there's more to scalability than storage space. They're not really helping the application scale on your network or in terms of queries per server.

For server load (as in queries processed per server), you're still hitting every server in some cases and either all of them or half of them on average for every query. You might as well be using just two servers so long as they can keep up. By hashing the data even using a simplistic method you'd clear up this roadblock.

From the networking standpoint, you're actually multiplying traffic. The query size times the number of servers the query hits can become quite a large number. You'd be actually better off with fewer bigger, faster servers with more storage as far as network congestion is concerned. By hashing the data, again even simply, you can cut your traffic drastically.

There is one drawback concerning hashing your data, though. It's not as general as just dropping the module in the place of another. You actually need to have some idea of what your data is going to be in order to divide it across servers with some level of balance with this method.


In reply to Re^7: RFC: OtoDB and rolling your own scalable datastore by mr_mischief
in thread RFC: OtoDB and rolling your own scalable datastore by arbingersys

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.