RFC: OtoDB and rolling your own scalable datastore

I've been playing around with Google's App Engine recently, and blogging about the datastore a little. It turns out they prefer their data denormalized -- which was kind of hard for me to adjust to at first, I must admit. Amazon's SimpleDB is the same way.

I've gotten used to the idea, however, at least for some application cases. Not only that, it got me to wondering what would happen if you used a standard RDBMS like MySQL and denormalized your data? Would it scale easily? Would its performance be good?

I pursued the idea and came up with OtoDB::MySQL, a Perl module that provides a simple interface for using data across multiple MySQL servers. It forces you to structure your data using one-to-one relationships, and because of this, makes scaling (adding additional data servers) a straightforward proposition. As it currently works, OtoDB::MySQL simply reads and writes data incrementally on each data server in a given list of servers, effectively striping data across n data units.

The "spec" can be downloaded here (pdf, 106k). Pretty much every thought I've had on the subject to date (correct or not) is in this document.

Here's the OtoDB 0.01 source (tar.gz).

use OtoDB::MySQL;

# Logic to iterate through units is the responsibility of the user.
# These structures are used for the round-robin approach in this 
# example.

my $curr_unit = 0;

my @units = (
    # unit, connection string, username, password
    ['srv1','DBI:mysql:otodb:localhost','root',''],
    ['srv2','DBI:mysql:otodb:10.1.6.212','root','sandwich']
);

 
# Instantiate the object.
# next_unit() handles iteration over the array of servers above.
# In a 'real-world' example, you might delegate these duties to an
# external server, a global library, etc.

my $m = OtoDB::MySQL->new( units=> \&next_unit );


# Create the data models.
# 'fields' are just MySQL column names/types

$m->create(
    model=>'library',
    fields=>{
        libid=>'int not null',
        name=>'varchar(30)',
        street=>'varchar(50)',
        city=>'varchar(30)'
    } );

# The 'oto_ref' column creates a link to the 'library' model.
# All references are required fields (i.e. 'not null')

$m->create(model=>'book',
    fields=>{
        title=>'varchar(30)',
        author=>'varchar(30)'
    },
    oto_ref=>'library'
);

 

# Now, we add some data
# $key = [ entity, unit (server), unique ID of record ]

my $key = $m->add(
    entity=>'library',
    data=> {

        # genid() creates a universal 'user' ID
        libid=> $m->genid( @{$units[0]} ), 
        name=> 'Library A',
        street=> '131 Street',
        city=> 'City One'
    }
);


# Link to 'library' using $key from previous add()

$m->add(
    entity=>'book',
    data=> {
        title=>'of mice and men',
        author=>'steinbeck'
    },
    oto_ref=>$key
);
 

# Add books to a second library
# Use first server as the ID server. Using a single server makes sure
# that user data IDs will be unique across all units.

my $libid = $m->genid( @{$units[0]} );


# Combine add() calls to satisfy 'oto_ref'.

$m->add(
    entity=>'book',
    data=> {
        title=>'of mice and men',
        author=>'steinbeck'
    },
    oto_ref=> $m->add(
        entity=>'library',
        data=> {
            libid=> $libid,  # unique 'user' ID for libary B     
            name=> 'Library B',
            street=> '111 Street',
            city=> 'City the Third'
        })
);


$m->add(
    entity=>'book',
    data=> {
        title=>'winter of our discontent',
        author=>'steinbeck'
    },
    oto_ref=> $m->add(
        entity=>'library',
        data=> {
            libid=> $libid,  # unique 'user' ID for libary B
            name=> 'Library B',
            street=> '111 Street',
            city=> 'City the Third'
        })
);


# Query data
# Data is conceptually 'flat', or one-to-one. The 'entity'
# parameter is basically a hint. The '<' means that the entity
# to the right is a sub-entity of the entity to the left. Consider:
#      entity => 'library < book < item_detail'
# Because 'book' is a many to 'library', save processing cycles
# by NOT reducing.

my @ents = $m->get(    
    entity=> 'library < book',
    fields=> 'library.city, library.name, library.street, book.title',
    filter=> 'library.name like ?',
    params=> ['%two%'],
    order=> 'book.title asc',
    reduce=>'no'        # 'reduce' assumed if omitted
);

 
# @ents is an array of lightweight objects - OtoDB::MySQL::Entity

foreach my $ent (@ents) {
    print $ent->field('name') . "\t" . $ent->field('city') . "\n";
    print "data:\t" . join('..', $ent->row), "\n";
}


# Set 'reduce' to 'yes' to eliminate duplicates due to data being 
# spread through multiple units. Generally, this will only be
# necessary for 'one' entities, i.e. entities that have 
# duplicates.

@ents = $m->get(
    entity=> 'library',
    fields=> 'library.name, library.street',
    order=> 'library.name desc',
    reduce=>'yes'
);

foreach my $ent (@ents) {
    print $ent->field('name') . "\t" . $ent->field(1) . ":\n";
    print "\tdata:\t", join('..', $ent->row), "\n\n";
}

 
# 'reduce' is specifically turned off. You will see duplicates from
# each unit

@ents = $m->get(
    entity=> 'library',
    fields=> 'library.name, library.street',
    order=> 'library.name asc',
    reduce=>'no'
);

foreach my $ent (@ents) {
    print $ent->field('name') . "\t" . $ent->field(1) . ":\n";
    print "\tdata:\t", join('..', $ent->row), "\n\n";
}

 

# Duplicating an entity
# dup() is given a filter expression, and returns the first entity
# it finds in the data store matching that expression. Here we
# add another book based on our 'library ID'.

$m->add(
    entity=>'book',
    data=> {
        title=>'grapes of wrath',
        author=>'steinbeck'
    },
    oto_ref=> $m->dup(
        entity=>'library',
        filter=>'library.libid = ?',
        params=> [$libid]  # remember this guy from above?
    )
);

 
# Update
# 'fields' is a collection of the column names and the new
# values you want to replace.

$m->update(
    entity =>'library < book',
    fields=>{ # use fully qualified field names
        'library.name'=>'new lib a',
        'book.title'=>'OFMICEANDMEN'
    },
    filter=>'library.name like ? and book.author like ?',
    params=>['%two%', 'steinbeck']
);

 
# Deleting 
# Delete based on a filter. Note: OtoDB does a check to see
# if there are any references from or to and entity, if you
# pass a single entity as a param. It dies if any exist.

$m->del(
    entity =>'library < book',
    filter=>'library.name like ?',
    params=>['%one%']
);


# Delete book matching title at a given library
$m->del(
    entity=> 'library < book',
    filter=> 'library.name like ? and book.title like ?',
    params=> ['%lib%', '%winter%']
);

 
# Increment through units, round-robin. A required subroutine.
# A reference to this routine is passed to the $m object upon
# instantiation, and it's how OtoDB::MySQL gets the next unit for
# data operations.

sub next_unit {

    # get current
    my $n = $curr_unit;

    #increment
    $curr_unit = ++$curr_unit > $#units ? 0 : $curr_unit;
    return $units[$n];
}
[download]

I find it hard to get a feel for how an idea really works until I've actually implemented something in it. So I made a rudimentary Wiki application called RubberWiki (included in the source above, see the rubwiki folder), based on OtoDB::MySQL. In my testing it was trivial to move from a single MySQL server to two (one config file edit), and query performance was good against both servers.

I think that querying -- OtoDB::MySQL->get() -- is currently where the bottleneck will occur as the number of data servers reaches a certain quantity. Some things that I thought would improve this:

threaded get()
cache servers (i.e. memcached)
distributed get()

I'd like to hear your thoughts. Thanks.

A blog among millions.

Comment on RFC: OtoDB and rolling your own scalable datastore Select or Download Code

Replies are listed 'Best First'.
Re: RFC: OtoDB and rolling your own scalable datastore by samtregar (Abbot) on Jul 14, 2008 at 21:52 UTC
Not only that, it got me to wondering what would happen if you used a standard RDBMS like MySQL and denormalized your data? Would it scale easily? Would its performance be good? Yes and yes. This is pretty much exactly how we built our data-warehouse at a prior gig. You can read about it here: http://www.oreillynet.com/pub/a/databases/2007/04/12/building-a-data-warehouse-with-mysql-and-perl.html Of course, a data-warehouse isn't really all that similar to what you're proposing, it just happens to also be less-than-normalized. Personally I think these unrelational DBs are solutions in search of a problem - RDBMs have problems, sure, but I'm not seeing how stripping away so many useful features helps me! -sam	[reply]
Re^2: RFC: OtoDB and rolling your own scalable datastore by arbingersys (Pilgrim) on Jul 14, 2008 at 23:26 UTC
Thanks for the link. Excellent article. Did you also stripe your data across multiple servers? Or just do the denormalization? A blog among millions.	[reply]
Re^3: RFC: OtoDB and rolling your own scalable datastore by samtregar (Abbot) on Jul 15, 2008 at 15:15 UTC
No, we didn't do any striping. I think that would be counter-productive for a data-warehouse, but it might be a fun experiment. -sam	[reply]
Re: RFC: OtoDB and rolling your own scalable datastore by chromatic (Archbishop) on Jul 14, 2008 at 19:47 UTC
If you're not going to take advantage of the features of an RDBMS, why use an RDBMS? Use a distributed tuple store such as memcached instead.	[reply]
Re^2: RFC: OtoDB and rolling your own scalable datastore by arbingersys (Pilgrim) on Jul 14, 2008 at 20:36 UTC
I would say because you get more out of an RDBMS than merely relational data structure. You also get: High level and powerful query language (OtoDB uses this feature rather extensively) Optimized storage and data retrieval engine (data indexes, etc) A well-known paradigm Additional power at the data server level like views, or clusters An RDBMS does more than just store and retrieve datasets. It also prepares data based on SQL commands you send it. Maybe I'm wrong, but it seems that memcached is just a storage mechanism. I.e. your code has to handle sorting/filtering. With an RDBMS you can ask the data servers to do this. There are probably some other reasons too, that I'm not thinking of. I should also point out that you are in fact using one relational feature, that of one-to-one relationships. A blog among millions.	[reply]
Re^3: RFC: OtoDB and rolling your own scalable datastore by chromatic (Archbishop) on Jul 15, 2008 at 00:14 UTC
You're not going to get 65k queries per second out of an RDBMS if any of those queries have to do JOINs, and you're not going to put Varnish in front of it to speed it up even further. You can do that with memcached and CouchDB, for example. I'm by no means suggesting that there's anything wrong with JOIN, or that every persistent data storage needs to do what eBay or MySpace or Amazon.com does. Yet by the time your problem is that your database is a bottleneck, slapping something on top of the database doesn't seem like the right approach. (Similarly, I have the strong belief that if ActiveRecord could run on something other than an RDBMS, everyone would be a lot happier.)	[reply]
Re^2: RFC: OtoDB and rolling your own scalable datastore by tilly (Archbishop) on Jul 15, 2008 at 00:43 UTC
One problem with memcached is that it is hard to take a backup of it. I love the product, but I also like my data to be a little safer than that. Using memcached in front of a database gives me the best of both worlds. Another reason to use an RDBMS is that while your application may not take advantage of the relational structure, it is still there to be exported to and manipulated on reporting databases.	[reply]
Re^3: RFC: OtoDB and rolling your own scalable datastore by arbingersys (Pilgrim) on Jul 15, 2008 at 04:02 UTC
Using memcached in front of a database gives me the best of both worlds This was kind of how I thought memcached fit best with OtoDB. As the code currently stands, you will only be able to add a certain number of data servers before querying them all incrementally will prove a bottleneck (at least if my intuition doesn't fail me here). memcached is the obvious choice to keep queries frosty as you add more servers. Also, to go along with your other point, an RDBMS also makes it easy to rebalance data, e.g. if you add new servers to a set of existing servers, and want to move some of the load over. A blog among millions.	[reply]
Re: RFC: OtoDB and rolling your own scalable datastore by CountZero (Bishop) on Jul 15, 2008 at 05:39 UTC
Even after reading your blog I fail to understand the benefits of the denormalized datastore. It seems that you either will fall into the trap of having lots of redundant data lying around (and the nightmare of keeping them all updated) or that you will have to re-invent all the JOIN and "reference" logic at the level of your application rather than having it in the datastore. Perhaps for some well defined problems the denormalized datastore is faster and more appropriate than a RDBMS, but in general it seems a poor choice. Or are we blinded by the ubiquity of the RDBMS model "If all you have is a hammer, the whole world looks like a nail"? CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^2: RFC: OtoDB and rolling your own scalable datastore by arbingersys (Pilgrim) on Jul 15, 2008 at 06:10 UTC
... you either will fall into the trap of having lots of redundant data lying around (and the nightmare of keeping them all updated) or that you will have to re-invent all the JOIN and "reference" logic at the level of your application rather than having it in the datastore. The data is redundant, certainly, but this falls under the mantra of "disk (and CPU) is cheap so stop worrying about it and scale", which seems to be Google's approach with the App Engine. You are sacrificing space for easy scalability. As to re-inventing the JOIN logic at the level of your application, OtoDB::MySQL does not (it does do some extra work on ordering, which I cover in the document above). It opts for the redundant data. Perhaps for some well defined problems the denormalized datastore is faster and more appropriate than a RDBMS, but in general it seems a poor choice. Here I agree completely. So does Google and Amazon, from what I can tell. OtoDB::MySQL is not going to replace intensive relational applications, like financial or business logic. But if you were LiveJournal, and realize you don't care about JOIN on a specific set of data (users), but you do have the problem of scaling, an OtoDB::MySQL solution makes sense, at least as far as I can see. A blog among millions.	[reply]
Re^3: RFC: OtoDB and rolling your own scalable datastore by CountZero (Bishop) on Jul 15, 2008 at 06:17 UTC
Thanks for your insights. However, it is not the space requirements of the redundant data which worry me (with Terabyte disks within most people's reach that is no longer an issue), but rather keeping all those redundant data coordinated. It seems a big task which the RDBMS data-model has proven to solve. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^4: RFC: OtoDB and rolling your own scalable datastore by arbingersys (Pilgrim) on Jul 15, 2008 at 15:56 UTC
Re^5: RFC: OtoDB and rolling your own scalable datastore by mr_mischief (Monsignor) on Jul 17, 2008 at 17:42 UTC
Some notes below your chosen depth have not been shown here
Re^2: RFC: OtoDB and rolling your own scalable datastore by Pic (Scribe) on Jul 15, 2008 at 15:25 UTC
I think the argument for denormalisation is something like this: If you have a DB that is mostly read from (and more importantly rows are rarely updated), with lots of data and many concurrent requests, denormalisation is a way to improve performance by avoiding the overhead of a join. But as you point out, the price is added complexity when updating. It's a useful trick, but maybe not as useful as some think. It is, after all, just another way to optimise performance.	[reply]
Re^3: RFC: OtoDB and rolling your own scalable datastore by CountZero (Bishop) on Jul 15, 2008 at 19:23 UTC
It is a trick used in certain data-mining applications. You denormalize your "live" data into a number of additional tables and you do your data-mining queries on these denormalized tables. It is very fast as there are much less JOINs to be done when querying the data, but you are limited in what you can query. If you do this denormalization as a batch job when the database server is not very busy (say every night or week-end) and put the additional tables on another server optimized for read-access, I can see the benefit. The drawback is that the data becomes stale as soon as the live server updates and that not all queries are possible because you see the data through the mirror of the denormalization process. Sometimes this is even considered a benefit: "Now you can get at your data without having to write SQL." Yeah sure what a benefit ... CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]