in reply to Re^2: RFC: OtoDB and rolling your own scalable datastore
in thread RFC: OtoDB and rolling your own scalable datastore

Thanks for your insights.

However, it is not the space requirements of the redundant data which worry me (with Terabyte disks within most people's reach that is no longer an issue), but rather keeping all those redundant data coordinated. It seems a big task which the RDBMS data-model has proven to solve.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

  • Comment on Re^3: RFC: OtoDB and rolling your own scalable datastore

Replies are listed 'Best First'.
Re^4: RFC: OtoDB and rolling your own scalable datastore
by arbingersys (Pilgrim) on Jul 15, 2008 at 15:56 UTC

    Because the data used by OtoDB is effectively flat you can stripe it between servers without very much effort. You just have to query that same set of servers to ensure that you are getting all the data back.

    Inserts work incrementally. The library makes an insert, and moves the pointer to the next server in the list and the next insert is done there. Somewhere, you must track the current insert server. In the RubberWiki demo I simply write the current server index to a text file. You could use a database table, or a dedicated server.

    Queries, updates, and deletes are done against all servers in the list. So instead of sending a SQL command to one RDBMS, you send it to n servers. (Again, it's only this easy because the data structure is simplified. Fully relational data is much harder to try and distribute this way.) OtoDB does this sequentially, which will be a bottleneck at some point, I think. But my other intuition is that these three are parallelizable.

    Of these, querying is the only one that requires a re-consolidation of the data. Update and delete just ask the data servers to perform some maintenance, and don't return anything to the user. When a query is run, it asks each server in the list to return the set of data generated by the SQL command (each server should return total_records/n records since incremental inserts should spread data evenly). In some cases the data will need to be reduced, and if ordering is requested, then a merge-order operation happens. These add additional work on the app side of things, but a pretty minimal amount from what I can see.

    That's it, coordination in a nutshell.

    A blog among millions.
      If you're dealing with flat, denormalized data spread among several servers, then what are the advantages over other techniques?

      You could work directly with fixed-width data files with fixed-width index files on a clustered file system. This solution lets the files system handle redundancy, distribution across multiple servers, and fault tolerance. The storage portion is already written, and it can be very efficient. You'd just need to write file handling, data locking, and search routines.

      OpenLDAP allows you to write to one server (with failover to another) and query as many different servers as you want round-robin. Some other LDAP servers allow more than one server to accept writes at a time. If your data is more hierarchical than relational, then using a hierarchical database like a directory service makes sense. Every benchmark I've done or read elsewhere shows OpenLDAP having the lunch of RDBMS systems on write-seldom, read-often data.

      If you're using relational databases, why are you querying servers in sequence to see which has the data? A good hashing algorithm for which DB server to query could cut down on quite a bit of traffic. Set up three different hash functions for three different data points in your data row. Hash against all three for each piece of data that comes in, and store to all three back-end servers that row maps to for each write. Then, you have three copies of everything, spread evenly among different servers (assuming good hash functions are selected). Then, you can hash against whichever portion you're querying against and get the data back out of just one server. Replicate the front-end, but don't bother replicating the back-end data stores because they're already storing in triplicate. If a data store server fails, you can reconstruct what it held from the front-end tables and the other data stores pretty easily, and in fact it'd be pretty simple to write a general-case program with DBI to do just that. As you have to scale up, you must adjust the hash functions to map to more back-end servers and prepopulate those servers with the appropriate data from the existing servers, but I don't see how to balance the storage load on new servers with your method at all other than pulling random rows across.

        You could work directly with fixed-width data files with fixed-width index files on a clustered file system.

        This is true. Or you could use tuple storage as chromatic pointed out. The thing I like about an RDBMS is that you get sort and filter for free, plus all the other things that come along with this type of data system (mentioned above). OpenLDAP would likely be better for hierarchical data, but I would point out that OtoDB is flexible enough to do hierarchies as well, but isn't good only for that.

        If you're using relational databases, why are you querying servers in sequence to see which has the data?

        Of your points, I like this one the best, because this is a problem I see with the design, and I'm still pondering it.

        First, I'll say that my thought all along for reducing network traffic was to couple OtoDB with caching, i.e. memcached. Straightforward and powerful.

        But it is inefficient to send a SQL command blindly to n servers, especially when using a WHERE clause that will only return n-y records (where y < n). For queries that return n or more records, I don't see a huge problem. In probably a lot of cases, records will easily be larger than n, and using an incremental insert, it's likely that data will exist on all servers for most queries.

        In my examples above, where you have libraries and books, even a small library is likely to have 1000 books. It's doubtful that you'll have > 1000 servers, and if you did, you would probably have caching anyway.

        But, given the case where you have a user profile and 50 servers, login is highly inefficient because you have to look on every server until you find the user and check his password. However, it wouldn't be hard to extend OtoDB (or add logic to your app), to simply store, on a single server, the username/password and a pointer to the data unit where the profile is located, reducing your queries from 50 to 2. Update: Or, couple OtoDB with a standard RDBMS server for some subset of the data, e.g. user login info.

        But really, I just see this as caching, and I'm wondering if it should be part of OtoDB itself, or relegated to something that is already doing it, and would probably do it better. That being said, it still bothers me that in some cases querying each server is overkill. I'm still mulling, and your suggestions have definitely given me some more to think about.

        As to adding servers to an existing set, this wouldn't automatically require rebalancing of data, but probably would in most cases. This is where using an RDBMS is helpful, because it wouldn't be terribly hard to create some backend processes that understand your data, and knows how to move some of it to the new server. OtoDB can't do this automatically, however.

        A blog among millions.