Re^6: RFC: OtoDB and rolling your own scalable datastore

You could work directly with fixed-width data files with fixed-width index files on a clustered file system.

This is true. Or you could use tuple storage as chromatic pointed out. The thing I like about an RDBMS is that you get sort and filter for free, plus all the other things that come along with this type of data system (mentioned above). OpenLDAP would likely be better for hierarchical data, but I would point out that OtoDB is flexible enough to do hierarchies as well, but isn't good only for that.

If you're using relational databases, why are you querying servers in sequence to see which has the data?

Of your points, I like this one the best, because this is a problem I see with the design, and I'm still pondering it.

First, I'll say that my thought all along for reducing network traffic was to couple OtoDB with caching, i.e. memcached. Straightforward and powerful.

But it is inefficient to send a SQL command blindly to n servers, especially when using a WHERE clause that will only return n-y records (where y < n). For queries that return n or more records, I don't see a huge problem. In probably a lot of cases, records will easily be larger than n, and using an incremental insert, it's likely that data will exist on all servers for most queries.

In my examples above, where you have libraries and books, even a small library is likely to have 1000 books. It's doubtful that you'll have > 1000 servers, and if you did, you would probably have caching anyway.

But, given the case where you have a user profile and 50 servers, login is highly inefficient because you have to look on every server until you find the user and check his password. However, it wouldn't be hard to extend OtoDB (or add logic to your app), to simply store, on a single server, the username/password and a pointer to the data unit where the profile is located, reducing your queries from 50 to 2. Update: Or, couple OtoDB with a standard RDBMS server for some subset of the data, e.g. user login info.

But really, I just see this as caching, and I'm wondering if it should be part of OtoDB itself, or relegated to something that is already doing it, and would probably do it better. That being said, it still bothers me that in some cases querying each server is overkill. I'm still mulling, and your suggestions have definitely given me some more to think about.

As to adding servers to an existing set, this wouldn't automatically require rebalancing of data, but probably would in most cases. This is where using an RDBMS is helpful, because it wouldn't be terribly hard to create some backend processes that understand your data, and knows how to move some of it to the new server. OtoDB can't do this automatically, however.

A blog among millions.

Comment on Re^6: RFC: OtoDB and rolling your own scalable datastore

Replies are listed 'Best First'.
Re^7: RFC: OtoDB and rolling your own scalable datastore by mr_mischief (Monsignor) on Jul 22, 2008 at 17:49 UTC
Let's work from your small library example with 1000 books and 50 servers. Let's say you have 500 users. Assume it's a small library of highly specialized volumes that have a lot of check-out contention. A username and password isn't much data to need to scale, but the information about all the books the user has checked out and checked back in could be. There's of course the data about the books themselves, too. Then there's the load of queries for your 500 users. Let's say 100 users at any time are hitting your database application. Each server ideally (without accounting for storage redundancy here) stores information about around 20 volumes and about 100 users. Let's assume your brute-force method first. If you query every server for every login, every check-out record, every check-in record, and every book that's part of the in-out records just to make a history of the 5 books a user has checked out, then you have 800 queries (50 for the login, 50 * 5 * 3 for the check-in, check-out, and book data entries). If you have 5% of your concurrent users (5 people) asking for their recent checkout history (averaging 5 books), you're dealing with 4,000 queries before even considering the other users. To find a specific book's record to see the summary info about it, you're either hitting all 50 server or you're doing a short-circuit linear search for an average of 25 queries. You can't short-circuit the check-in and check-out queries mentioned before since there can be multiples of those. Now, let's say that there's a very small table (or even just a configuration file for the application, but we'll assume you'll use the DB for it for ease of update) on every server which gives some hashing info. Let's assume for ease of hashing that users have an ID number of at least two digits that's all digits for their username. Users with IDs ending in 42 and 84 have their login data on server 42, while users with ID numbers ending in 09 and 18 have their info on server 9. The last two digits of the ISBN number map just as well to 50 servers -- either the digits are the server number or twice the server number (00 goes to server 50). So, you now have one query for the user and one query for each book with just a case statement for overhead. With 5 users gravving their checkout history at an average of 5 books checked out, your system handles 1 + 5 + 5 queries each, or 55 total. That's a drop of 86.75% in queries for that type of operation. To find a particular book's record, you issue 1 query. That's a drop of 92 to 96% vs. querying the servers in order. Your additional servers are offering you additional storage, but there's more to scalability than storage space. They're not really helping the application scale on your network or in terms of queries per server. For server load (as in queries processed per server), you're still hitting every server in some cases and either all of them or half of them on average for every query. You might as well be using just two servers so long as they can keep up. By hashing the data even using a simplistic method you'd clear up this roadblock. From the networking standpoint, you're actually multiplying traffic. The query size times the number of servers the query hits can become quite a large number. You'd be actually better off with fewer bigger, faster servers with more storage as far as network congestion is concerned. By hashing the data, again even simply, you can cut your traffic drastically. There is one drawback concerning hashing your data, though. It's not as general as just dropping the module in the place of another. You actually need to have some idea of what your data is going to be in order to divide it across servers with some level of balance with this method.	[reply]
Re^8: RFC: OtoDB and rolling your own scalable datastore by arbingersys (Pilgrim) on Jul 23, 2008 at 21:23 UTC
Thanks for your thoughts. I'm a little confused about the 800 queries you mention. Here's what I get in terms of finding check-in/check-out history for a user. 1. User logs in We have to query 50 servers, definitely something you don't want. I'm beginning to think that if I were really building this application, I would have a standard RDBMS for some data, like logins, and OtoDB only for what high-volume read-intensive data exists. (A library system is actually a poor example in retrospect; I should have used something that made more sense from a scalable Internet site perspective...) 2. User goes to checkout history page (Now that we have his unique ID) Here's how the user_history table might look: user_id \| check_out_date \| check_in_date \| book_title \| book_id We query 50 servers for user_id. (If check_in_date is empty, the book is still out.) Total queries: 100 Redundant data is stored in the user_history table, i.e. full book info is also stored somewhere else, but it's been optimized for reading, and spread amongst the 50 servers. The real problem that I see is with the likelihood that a user will have checked out less than 50 books, and we're sending queries to servers that aren't going to have any data. (Of course, if this system existed, I doubt they'd start out with 50 data servers, or use anything other than an RDBMS for that matter.) I know that network traffic in terms of queries and network overhead for data returned increases by the number of servers present. But as the user checks out more than 50 books, it's more and more likely that he has data on every server. So we send 50 query requests over the network to the servers, and get data returned on the order of 2 or less records per server, each served over a 100MB switch port. As opposed to one server returning 50 or more records over a single 100MB switch port. I do like your ideas for hashing, but what you describe above seems more on the order of data sharding. Which some sites have done successfully to handle growth, from what I've read. The problem you have then is rebalancing data as servers with high volume get loaded down. In my example above, however, if I did use a hashing/whatever scheme to connect book_id to a particular server, then when the user clicks to read the full details, the system would know to go specifically to a single server for it's next query. A blog among millions.	[reply]
Re^9: RFC: OtoDB and rolling your own scalable datastore by mr_mischief (Monsignor) on Jul 23, 2008 at 21:41 UTC
Well, your data denormalization was broader in scale than mine. That alone does cut down a good deal on the number of queries that are necessary. Your statements about the traffic are a bit off, though. One query returning 50 records on one switch port to one other switch port (your app frontend that needs the results) is more efficient use of your network than 50 queries each returning one result as individual TCP streams back to your one port on the frontend. Remember that these are multiplied by your concurrent users, so 10 users making a 50 record query with one record per server means 50 * 10 queries and 50 * 10 responses with all the overhead of each of those. You could have 50 overall queries to 50 servers and 50 overall responses, even though they are 50 times the size.	[reply]