in reply to What kind of database should I use?

I know that this is not what you want, but I wrote me a file indexer which uses DBD::SQLite or any other DBI database. Insert-or-update is built-in, but not atomic, but as SQLite is my main focus, I know that I'm the only user on it while I have the connection...

Insertion of a new item is as fast as the MD5 can be computed (if you enable MD5s), and scanning of the database for stale files (that is, files that don't exist in the real world anymore, or changed their timestamp or size) is at the cost of a stat call per file. That's not ideal, but it's hard to imagine how it could get faster and not fall prey to the complexity that File::Find has regarding file systems.

The design is very simple - a single table, containing the filename as an absolute file:// URI, and the file size, timestamp and MD5, plus arbitrary other metadata. All metadata goes into that table as columns, so once you add audio files and images, that table becomes denormalized, but that's a small price to pay for very simple queries.

The synopsis code of the module looks like this:

use FFRIndexed; my $idx = FFRIndexed->new( create_if_needed => 1 ); $idx->scan('/home/media/mp3/'); # this will rescan all files and stat them ... my $files = $idx->query( { mime_type => 'audio/mpeg', artist => 'Jam +iroquai' }); printf "%s - %s - %s\n", $_->album, $_->title, $_->uri for @$files;

In fact, the Table::Denormalized distribution contains a complete "media files browser" application, mostly geared towards interactively browsing a collection of audio files. If anyone wants to look into this closer or even adopt modules out of it, you're very welcome!

Replies are listed 'Best First'.
Re^2: What kind of database should I use?
by Anonymous Monk on Feb 20, 2006 at 02:32 UTC
    Denormalising data is a bad idea if you even want meanful data back out. I would add a table called file_types and use that as a primary key for a field added in your main table, for each file type you have create another table that holds information specific to that type. then just join to the appropriate tables based on the type indicated in the main table. This also give you a nice way to be able to filter based on file type without having to meddle with extensions etc

      I'm well aware of how it's "done right", but as I say in the documentation, this is the fifth generation and the contrast of doing the normalized queries and the denormalized queries was quite strong - the denormalized table allows me to formulate the queries in a very simple fashion, which was not possible with the normalized data. As the file types get saved in a separate column anyway (mime_type), querying by filetype doesn't imply mucking around with the extension either ;-). But most of that isn't obvious from what I said above, nor from the (lacking) documentation ...