in reply to RFC: OtoDB and rolling your own scalable datastore

Even after reading your blog I fail to understand the benefits of the denormalized datastore. It seems that you either will fall into the trap of having lots of redundant data lying around (and the nightmare of keeping them all updated) or that you will have to re-invent all the JOIN and "reference" logic at the level of your application rather than having it in the datastore.

Perhaps for some well defined problems the denormalized datastore is faster and more appropriate than a RDBMS, but in general it seems a poor choice.

Or are we blinded by the ubiquity of the RDBMS model "If all you have is a hammer, the whole world looks like a nail"?

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

  • Comment on Re: RFC: OtoDB and rolling your own scalable datastore

Replies are listed 'Best First'.
Re^2: RFC: OtoDB and rolling your own scalable datastore
by arbingersys (Pilgrim) on Jul 15, 2008 at 06:10 UTC

    ... you either will fall into the trap of having lots of redundant data lying around (and the nightmare of keeping them all updated) or that you will have to re-invent all the JOIN and "reference" logic at the level of your application rather than having it in the datastore.

    The data is redundant, certainly, but this falls under the mantra of "disk (and CPU) is cheap so stop worrying about it and scale", which seems to be Google's approach with the App Engine. You are sacrificing space for easy scalability.

    As to re-inventing the JOIN logic at the level of your application, OtoDB::MySQL does not (it does do some extra work on ordering, which I cover in the document above). It opts for the redundant data.

    Perhaps for some well defined problems the denormalized datastore is faster and more appropriate than a RDBMS, but in general it seems a poor choice.

    Here I agree completely. So does Google and Amazon, from what I can tell. OtoDB::MySQL is not going to replace intensive relational applications, like financial or business logic. But if you were LiveJournal, and realize you don't care about JOIN on a specific set of data (users), but you do have the problem of scaling, an OtoDB::MySQL solution makes sense, at least as far as I can see.

    A blog among millions.
      Thanks for your insights.

      However, it is not the space requirements of the redundant data which worry me (with Terabyte disks within most people's reach that is no longer an issue), but rather keeping all those redundant data coordinated. It seems a big task which the RDBMS data-model has proven to solve.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Because the data used by OtoDB is effectively flat you can stripe it between servers without very much effort. You just have to query that same set of servers to ensure that you are getting all the data back.

        Inserts work incrementally. The library makes an insert, and moves the pointer to the next server in the list and the next insert is done there. Somewhere, you must track the current insert server. In the RubberWiki demo I simply write the current server index to a text file. You could use a database table, or a dedicated server.

        Queries, updates, and deletes are done against all servers in the list. So instead of sending a SQL command to one RDBMS, you send it to n servers. (Again, it's only this easy because the data structure is simplified. Fully relational data is much harder to try and distribute this way.) OtoDB does this sequentially, which will be a bottleneck at some point, I think. But my other intuition is that these three are parallelizable.

        Of these, querying is the only one that requires a re-consolidation of the data. Update and delete just ask the data servers to perform some maintenance, and don't return anything to the user. When a query is run, it asks each server in the list to return the set of data generated by the SQL command (each server should return total_records/n records since incremental inserts should spread data evenly). In some cases the data will need to be reduced, and if ordering is requested, then a merge-order operation happens. These add additional work on the app side of things, but a pretty minimal amount from what I can see.

        That's it, coordination in a nutshell.

        A blog among millions.
Re^2: RFC: OtoDB and rolling your own scalable datastore
by Pic (Scribe) on Jul 15, 2008 at 15:25 UTC

    I think the argument for denormalisation is something like this: If you have a DB that is mostly read from (and more importantly rows are rarely updated), with lots of data and many concurrent requests, denormalisation is a way to improve performance by avoiding the overhead of a join. But as you point out, the price is added complexity when updating.

    It's a useful trick, but maybe not as useful as some think. It is, after all, just another way to optimise performance.

      It is a trick used in certain data-mining applications. You denormalize your "live" data into a number of additional tables and you do your data-mining queries on these denormalized tables. It is very fast as there are much less JOINs to be done when querying the data, but you are limited in what you can query. If you do this denormalization as a batch job when the database server is not very busy (say every night or week-end) and put the additional tables on another server optimized for read-access, I can see the benefit.

      The drawback is that the data becomes stale as soon as the live server updates and that not all queries are possible because you see the data through the mirror of the denormalization process. Sometimes this is even considered a benefit: "Now you can get at your data without having to write SQL." Yeah sure what a benefit ...

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James