Moving from SQL to flat-files

punkish has asked for the wisdom of the Perl Monks concerning the following question:

My one-big website wiki/blog/cms application was based on SQLite. It worked well... until I upgraded my database version and the corresponding DBD::SQLite version, but my web host's version numbers lagged. Well, at this change, the db formats became incompatible and everything went up in smoke.

After I had cleaned up the debris, I decided to change things. In my quest for the "simplest thing that would work," and inspired by the database wars on http://radar.oreilly.com/ (too many links to mention here), I decided to convert my datastore to flat-files. Here is my experience in the hope that other monks will offer me suggestions and insights to make this better.

First item on the tackle-list was organization. Aware that too many files in a folder slow things down, and inspired by CPAN's apparent organization, I decided to implement an alphabetical hierarchy like so --

[datastore]/P/PE/PER/PerlMonks
[datastore]/P/PE/PER/PerlTheLanguage
[datastore]/F/FL/FLA/FlatFiles

.. in other words ..

my $FileName;
my ($DIR, $DI, $D);
if (length($FileName) >= 3) {
  $DIR = uc(substr($FileName, 0, 3));
  $DI  = substr($DIR, 0, 2);
  $D   = substr($DI, 0, 1);
}

# in case the FileName is 2 characters long (less than 2 not possible)
elsif (length($FileName) == 2) {
  $DIR = uc($FileName) . '0';
  $DI  = uc($FileName);
  $D   = substr($DI, 0, 1);
}
[download]

A'ite. That works. Now, on to metadata. A file allows only three possible metadata -- the FileName (the title of the web page), the contents of the File (basically, the contents of the web page), and its mtime. What about all manner of other items I had stored in my database table? After pondering a bit, I decided to create a shadow file prefixed with a period. On my Mac, that file is invisible (although it mucks up the landscape on my Windows box), so it is there but not there. I stored all the remaining metadata in this file in the same location as the main file in a key:value pattern

[datastore]/P/PE/PER/.PerlMonks
[datastore]/P/PE/PER/PerlMonks
[datastore]/P/PE/PER/.PerlTheLanguage
[datastore]/P/PE/PER/PerlTheLanguage
[datastore]/F/FL/FLA/.FlatFiles
[datastore]/F/FL/FLA/FlatFiles
..
# and inside a metadata file
   page_created_on: 2005-07-18
   page_created_by: punkish
  page_modified_on: 2005-07-18
  page_modified_by: punkish
      num_of_stars: 3
     privacy_level: 1
         prev_page: PrevPage
         next_page: NextPage
         firstname: Pun
        middlename: 
          lastname: Kish
           company: 
             title: 
          address1: 
          address2: 
              city: 
             state: 
               zip: 
           country: 
            phone1: 
            phone2: 
            phone3: 
          birthday: 
             email: 
               url: punkish.eidesis.org
contact_created_on: 2005-08-21 00:20:00
contact_created_by: 1
               lat: 43.065816
               lon: -89.409005
[download]

Well, the above is working quite well. Some other things are not done yet but they will also get resolved with time and ingenuity. My question for now -- what insights can monks offer me with regards to the above?

Update: Fixed typo in a link above.

--

when small people start casting long shadows, it is time to go to bed

Comment on Moving from SQL to flat-files Select or Download Code

Replies are listed 'Best First'.
Re: Moving from SQL to flat-files by dragonchild (Archbishop) on May 09, 2006 at 15:53 UTC
This is the perfect use for DBM::Deep. And, in 1.00, you'll have ACID transactions. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^2: Moving from SQL to flat-files by jZed (Prior) on May 09, 2006 at 17:21 UTC
I've long wanted to have the time to make a DBD::DbmDeep. It ain't gonna happen. If anyone wants to do this, it would provide DBI/SQL access to DBM::Deep's excellent backend. As mentioned elsewhere in this thread, DBD::DBM would be a good place to start. In fact, it's possible it already works with DBM::Deep (yeah, yeah, it's my code but I've forgotten what it does). If you (dragonchild) or anyone else wants to pursue this, I'd be glad to help.	[reply]
Re^3: Moving from SQL to flat-files by dragonchild (Archbishop) on May 09, 2006 at 18:57 UTC
I'm actually going in the other direction and writing Presto, which is a OODBMS built on top of DBM::Deep. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^4: Moving from SQL to flat-files by jZed (Prior) on May 09, 2006 at 19:43 UTC
Re^5: Moving from SQL to flat-files by dragonchild (Archbishop) on May 09, 2006 at 19:49 UTC
Re^2: Moving from SQL to flat-files by punkish (Priest) on May 09, 2006 at 16:21 UTC
I did look at DBM::Deep. It is a very impressive and simple piece of work. If I were to implement a DBM-based datastore, I would definitely choose DBM::Deep over DB_File and its ilk. There is something seductive about flat-files, however. I can open them in a text editor and edit them without requiring a Perl or web interfacec to get to them. To that end, my question devolves into -- how can I maintain the metadata also in one file? Right now I have the `MainFile` and its shadow `.MainFile`. If I want to change something about the MainFile, I have to open one or the other or both. It would be nice to do it all in one place. I tinkered with something like (contents of a supersized MainFile below) -- `MainFile This is the MainFile. Blah blah blah. __METADATA__ key:value key:value key:value` [download] But I couldn't figure out a efficient and elegant way of reading that in and separating the data from the metadata, returning the metadata in a hash. -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l] [select]
Re^3: Moving from SQL to flat-files by jhourcle (Prior) on May 09, 2006 at 16:46 UTC
Before you come up with yet another metadata syntax, you might want to look at LDIF, which your syntax looks similar to. But as for reading the data / metadata, it's not too hard: `my @data = (); while ( my $line = <IN>) { last if ($line =~ /^__METADATA__$/); push @data, $line; } # assuming no multivalue keys my %metadata = map { chomp $_; split( $_, /:/, 2) } (<IN>);` [download] I'd personally put the metadata before the data -- as there's no chance in the marker being in the metadata section, but there is in the data. I don't know what your usage patterns are, so it might be an additional overhead to be skipping the metadata whenever it's not needed, though: `while (my $line = <IN>) { last if ($line =~ /^__DATA__$/); } my @data = <IN>;` [download] I'd also avoid hashing on titles / names if the data is going to grow significantly, and the title isn't the only indexor. The english language just doesn't have good distribution.	[reply] [d/l] [select]
Re^4: Moving from SQL to flat-files by punkish (Priest) on May 09, 2006 at 17:01 UTC
Re^3: Moving from SQL to flat-files by dragonchild (Archbishop) on May 09, 2006 at 19:05 UTC
The key behind DBM::Deep that your flatfiles will not be able to do is that DBM::Deep is low-memory. Everything happens on disk. Presto (an OODBMS built upon DBM::Deep) will take that even further, providing you with the ORM-like syntax without the object-relational impedance mismatch. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re^4: Moving from SQL to flat-files by punkish (Priest) on May 09, 2006 at 20:05 UTC
Re^5: Moving from SQL to flat-files by dragonchild (Archbishop) on May 09, 2006 at 20:24 UTC
Some notes below your chosen depth have not been shown here
Re: Moving from SQL to flat-files by jZed (Prior) on May 09, 2006 at 17:15 UTC
Don't know if this helps at all but "SQL" and "flat file" are not mutually exclusive terms. My SQL::Statement supports SQL access to dozens of different kinds of flat files including CSV, DBM (including BerkeleyDB), XML, etc. If you have lots of legacy code using SQL, you could fairly easily use S::S to write a DBD for your metadata format. If you go that route, look at DBD::BDM and DBD::File (both in the DBI distribution) which both have notes on how to make a custom DBD on top of S::S.	[reply]
Re: Moving from SQL to flat-files by blazar (Canon) on May 09, 2006 at 15:18 UTC
Well, the above is working quite well. Some other things are not yet but they will also get resolved with time and ingenuity. My question for now -- what insights can monks offer me with regards to the above? Well I don't have experience with YAML myself, but if you're going to use flat files it may well be worth giving it a try.	[reply]
Re: Moving from SQL to flat-files by perrin (Chancellor) on May 09, 2006 at 19:52 UTC
Will multiple people use this site? Are you prepared to deal with concurrent access, locking, race conditions, etc.? These are the problems that an RDBMS solves for you. I'd encourage you to research some other options before starting from scratch. The road from nothing to scalable multi-user data storage is a long one.	[reply]
Re: Moving from SQL to flat-files by Tanktalus (Canon) on May 10, 2006 at 03:56 UTC
I need to wholeheartedly agree with perrin. Rather than go the route of ripping out the basis upon which scalability is established (your RDBMS), why not merely replace it with one that fits your needs better? At least maintaining your SQL so that you can scale again later is a huge start - jZed's suggestion to use DBD::File or other similar modules that allow you to use plain SQL to access data in otherwise flat files would tide you over allowing you to get back up and running with a minimum of fuss while allowing you to continue investigating a more robust solution (no offense intended, jZed) should the need actually be there. Popular alternatives are mySQL and PostgreSQL, both in the free realm. But I've also heard that MS SQL, Oracle, and DB2 all have free-for-production-use server products available as well, with all the consummerate feature enhancements and full scalability of being able to upgrade (for a price) to their larger-scale offerings should it be needed (including high availability failovers, clustering, etc.). Rather than throwing out the baby with the bathwater, I urge you to reconsider. Using flat files as an intermediary makes sense (especially if it's CSV format which is often importable directly into most other RDBMS formats). But using it for production code is somewhat scary. Personally, I think I'm pretty smart. But I definitely have not taken the time to understand everything that a database offers me. I'm just happy to take advantage of it. In fact, in the production of a website built to track test plan progress, my manager not only told me not to do it, but not to use a database in doing it. (I don't quite remember how the latter point came up given the former.) So I started with flat files. But then I started worrying about concurrency and ... well, then I just went ahead and used a database (DB2, actually). And I didn't have any problems with multiple users logging their tests and overwriting each other. Without spending any time on it (other than learning DB2).	[reply]
Re^2: Moving from SQL to flat-files by punkish (Priest) on May 10, 2006 at 12:58 UTC
Popular alternatives are mySQL and PostgreSQL, both in the free realm. </blockquot> Choice of rdbms was not the issue... I already had a pretty fine engine in SQLite, and it was doing my job well. The immediate issue was portability. There is a certain level of complexity required with maintaining a database. In the case of SQLite, I had to usually upgrade my <cpan://DBD::SQLite> whenever that was upgraded, because it gave to me whatever security, quality, or performance enhancements the underlying SQLite brought. Then one day, bam!, the SQLite format became binary incompatible with earlier versions. I needed the new version because `VACUUM` was giving me an error, and that was fixed in the new version. But, my database suddenly became "different." I fixed it all on my computer, but my hands were tied as far as my web host was concerned. They wouldn't upgrade the driver until it was deemed "stable" for them (understandable, but it boofed me). The other issue often cited is scalability and concurrent use. I am not sure how much of this is just accepted wisdom (folklore?) but somehow it seems folks accept that flat files don't scale well. Concurrent use is definitely an issue, but not for me right now... I am designing a primarily single user application. Even when it becomes multiple user, I am not talking about an Amazon.com here. Can a flat files driven app handle a few thousand hits a day? If yes, I am set. Of course, the "database war" stories on radar.oreilly.com (link in the OP) mention websites with much higher use and content. So, perhaps scalability is myth. Then there is the issue of accessibility (defined my way). With SQL, all the data are hidden in the SQL engine. They are only accessible through an application, be it my web app, or some other interface. With flat files, I can open them with just a text editor. Push comes to shove, I can ssh into my web host, fire up vim, and edit a file, and the changes will show up in my app. The last issue is of a "challenge." I have been quite taken by the "simplest thing that could work" bit (did Ward Cunningham postulate that?). So, I've set out to see if I can redo the app with flat files. Yes, redoing the SQL into non-SQL logic is quite some work, but only becacuse I had trained myself to think the SQL way. SQL engines are very important, and there is a lot dogwork hidden inside of them. But, are there times when they are simply unnecessary? I have actually learned all sorts of neat programming techniques since I started battling converting my app to flat files. Hope this makes sense. But I am also learning from all the different viewpoints I am hearing here. -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l]
Re^3: Moving from SQL to flat-files by perrin (Chancellor) on May 10, 2006 at 20:22 UTC
So you had a bad upgrade experience with SQLite and you're solving it by writing your own database? Sounds like you're throwing the baby with the bathwater to me. Can a flat files driven app handle a few thousand hits a day? Safely? Without corrupting the data or losing updates? Yes, but only if you put lots of work into it. Locking is not easy. You can get something fairly safe if you carefully use exclusive locks, but it won't scale very far before people end up getting blocked when trying to update. Maybe enough for you, maybe not. Of course, the "database war" stories on radar.oreilly.com (link in the OP) mention websites with much higher use and content. So, perhaps scalability is myth. No! Don't draw the wrong conclusion from that series. For one thing, those people are usually talking about something totally different from what you're doing when they say "flat-files." Google doesn't do anything remotely similar to flat-files. They wrote a filesystem! They wrote a distributed, indexed storage system with locking and logging and most everything else RDBMSes have. One guy was using BerkeleyDB and calling it flat-files. Some of them are just storing data that is compiled off-line and never updated by the application, so they have no concurrency issues. Some of them are talking about data that they just read all of and never query against. You can't pull any general ideas about the scalability of flat-files as an RDBMS replacement from that O'Reilly series.	[reply]
Re^3: Moving from SQL to flat-files by Tanktalus (Canon) on May 10, 2006 at 13:33 UTC
What I kind of meant was that these other databases may provide better migration stories. Migration from one version to the next, or migration from one platform to another (although I suspect MS SQL may not be so good at that ;->). As far as SQLite fixing a problem that you couldn't deploy on your server - which is worse: not being able to deploy a known fix, or not being versed enough in ACID transaction compliance and implementation to provide a secure method to ensure no loss of data in your own private modules? Personally, I'd rather stick with SQLite and force my provider to upgrade (or move to a new provider) than try to fight with unknowns which I won't discover until I've already lost data. Part of the scalability of "real" database servers is that you can scale your webserver and your database server independantly. At first, they live on the same box. Then you get too many hits, so you move your database to a second box. If you're using flat files, you can't quite do this since you can't offload your data retrieval/saving to a second box - you need to use that second box as a second webserver, but use NFS for sharing the data. Now things get a bit messy - locking across NFS isn't nearly as safe. As things move on, you get more and more webservers just to handle the traffic. At some point, you may have so much going on that you actually need a second database server. At this point, you'd probably need to nearly double your webservers for flat files. But, because you've offloaded so much of that work to a machine with lots of RAM (for buffering), and fast disks, you only need to duplicate the single machine. And now you need something with clustering capabilities to manage the data. I realise that SQLite isn't a client/server database. But, because you've already built all your logic with SQL, it does provide a great starting point to move to a client/server database system should the need actually arise. Like the rest of the above scalability, you may never need it. But, should you actually need it, you'll be kicking yourself if it's not built in from the ground up. Flat files bring with them so many of the concurrency and data integrity issues that databases have solved that I would simply never want to store data in something that hadn't solved those issues for me.	[reply]
Re: Moving from SQL to flat-files by perrin (Chancellor) on May 10, 2006 at 20:30 UTC
By the way, here's a useful description of some of the locking and searching issues you will encounter with flat-files: http://philip.greenspun.com/panda/databases-choosing	[reply]