bart has asked for the wisdom of the Perl Monks concerning the following question:

I have a meta-question on what kind of database I should use, in your opinion, to implement a meta-database about a directory tree, storing extra data per file, like author, title, MD5 or SHA digest...

I'm not in favour of an SQL database like SQLite, because of the clumsiness of working with it. Especially the lack of native, atomic update-or-insert operation is holding me back. I dislike this so much, that this is the reason I've postponed actually implementing it for months. I want something simpler. Well, if you know of an extra (fast) layer that would make it easier to work with, then I'm still all ears...

Actually, I like how Perl hashes work: insert-or-update just comes with the territory.

The main problem is updating: I want quick and fast incremental updates, not a complete rescan every time, because that could easily take 20 minutes every time.

So, what are alternatives, for Perl? It should be indexed by path, which is, again, a bit impractical for SQL databases. As for updating, adding new files to the database should go fast, which means a quick elimination of already existing entries, as well as detecting changes in the existing files (change of modification time and/or size). Ideally, I should be able to detect renames and moves per file, by detecting a file has gone missing and another one with the same checksum has popped up. Hmm... I can do that.

So... What would you do? BerkeleyDB? Anything else?

Update: Actually, an SQL database has another selling point: it's easy to search for (candidate) duplicates by just using a query on MD5 checksum and file size. A feature I'd also like to see in any approach.

Replies are listed 'Best First'.
Re: What kind of database should I use?
by dragonchild (Archbishop) on Feb 19, 2006 at 02:06 UTC
    DBM::Deep seems to be designed for this kind of thing. It's Perl datastructures backed to disk. :-)

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      It's great as long as you stay below 2GB, then it silently corrupts your data. Nice module, good to be aware of how you're using. That's only current versions and that failure isn't intrinsic so I expect future versions to work just fine.

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

        Do you have a test case? I'm currently working with the maintainer to improve DBM::Deep for another project of mine and I'd like to fix that scenario.

        Have you tried DBM::Deep::set_pack( 8, 'Q' );?


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: What kind of database should I use?
by leocharre (Priest) on Feb 19, 2006 at 06:18 UTC

    oh crap, man.. you bring tears of joy and insanity to my eyes. this is what i lost the first 2 weeks of this month on.

    Im making a web app to let people interact with files accross the web. yey!.. Suddenly I feel like i just walked into a party with beautiful people and i forgot to put my pants on. oh, and i'm covered in excrement.

    so users have to be able to find files, etc .. get info on files.. and there are a *lot* of freaking files.

    all security issues aside (just because they are out of context for this node)... i went out on an insane chase to rebuild a filesystem
    and i had no idea i was doing it.

    i need a mysql table that stores stuff on files- just like you say.. md5 etc .. everything.

    An indexer.

    the first thing it did was find every freaking dir and make sure it was in the db table, it would use find to get mod times too. then it queried the db for every dir record and mod time previously recorded. then it compared the mod times and existence- deleted all records on db that were no longer on disk , and then inserted all new dirs not db

    second part. so.. now we have a list of all dirs that are out of sync by mod time. so i do this per directory:

    get a listing of all regular files, and mod times from filesystem, and do the same for all records on db.
    then compare them, delete db records not on disk any longer, and insert new records (ones that were not on db).

    since this was a web thing, i actually do not update the db.. not exactly. when a user is in a directory, the app tests mod time on file for that dir, and compares to db mod time for that same dir. if it is diff, it runs the indexer and then you see the data. worked blazing fast for all kinds of junk.. md5sum too.. which can be time eating.

    but.. had a lot of trouble with data corruption indeed..

    I abandoned the whole thing. it's too unreliable to keep up to date. a LOT can go freaking wrong, the data can become really corrupt. the problem is .. the more you do to keep it current- the more taxing it is- the more processor power you are asking.. and it sort of beats the entire purpose!

    i am really tempted to try it again, only using a db created with updatedb and locatedb stuff.. any ideas?

    how do i import a file database (locatedb) created by updatedb into a mysql table? is this something i could do every hour for a repo of 30k files ?

    what kind of db is one of these ?

    Until I am able to consolidate filesystem to db with reliability - i have been actually using a mix of output from find, stat, and locate, to get my real time data for the user.

    I should post my code somewhere (it really won't fit here), maybe we're *not* re-inventing the wheel trying to do this after all.

    update: i wanted to mention also.. making sure the data was inserted for about 20k files took my script about 5 minutes. that's with no data in db- first run. . as for an update, the startup was slow, about 40 seconds. (it had to get ALL info for 6k dirs on disk, and 6k dirs on db.) - and then an average update took maybe on the whole, 1.5 minutes to 3 minutes. keep in mind, this is a repository with files that are being constantly changed by about 10 people.

Re: What kind of database should I use?
by Corion (Patriarch) on Feb 19, 2006 at 18:10 UTC

    I know that this is not what you want, but I wrote me a file indexer which uses DBD::SQLite or any other DBI database. Insert-or-update is built-in, but not atomic, but as SQLite is my main focus, I know that I'm the only user on it while I have the connection...

    Insertion of a new item is as fast as the MD5 can be computed (if you enable MD5s), and scanning of the database for stale files (that is, files that don't exist in the real world anymore, or changed their timestamp or size) is at the cost of a stat call per file. That's not ideal, but it's hard to imagine how it could get faster and not fall prey to the complexity that File::Find has regarding file systems.

    The design is very simple - a single table, containing the filename as an absolute file:// URI, and the file size, timestamp and MD5, plus arbitrary other metadata. All metadata goes into that table as columns, so once you add audio files and images, that table becomes denormalized, but that's a small price to pay for very simple queries.

    The synopsis code of the module looks like this:

    use FFRIndexed; my $idx = FFRIndexed->new( create_if_needed => 1 ); $idx->scan('/home/media/mp3/'); # this will rescan all files and stat them ... my $files = $idx->query( { mime_type => 'audio/mpeg', artist => 'Jam +iroquai' }); printf "%s - %s - %s\n", $_->album, $_->title, $_->uri for @$files;

    In fact, the Table::Denormalized distribution contains a complete "media files browser" application, mostly geared towards interactively browsing a collection of audio files. If anyone wants to look into this closer or even adopt modules out of it, you're very welcome!

      Denormalising data is a bad idea if you even want meanful data back out. I would add a table called file_types and use that as a primary key for a field added in your main table, for each file type you have create another table that holds information specific to that type. then just join to the appropriate tables based on the type indicated in the main table. This also give you a nice way to be able to filter based on file type without having to meddle with extensions etc

        I'm well aware of how it's "done right", but as I say in the documentation, this is the fifth generation and the contrast of doing the normalized queries and the denormalized queries was quite strong - the denormalized table allows me to formulate the queries in a very simple fashion, which was not possible with the normalized data. As the file types get saved in a separate column anyway (mime_type), querying by filetype doesn't imply mucking around with the extension either ;-). But most of that isn't obvious from what I said above, nor from the (lacking) documentation ...

Re: What kind of database should I use?
by bart (Canon) on May 14, 2006 at 20:13 UTC
    I did a first implementation using DBD::SQLite (3). No it's not just exactly the same as using any other SQL database. I'm quite happy with the result, for the following reasons:
    • There's no need to set up a table the hard way. I just declare the filepath column as text. There's no practical limit on how long the string can be that I can store in it.
    • Likewise, I can easily add a unique index on the filepath, without specifying a maximum length. Contrast this to Mysql, where you have to specify a "prefix" length. Anything in a column string beyond that length, is simply ignored for the index (!). So, in SQLite, specifying that my filepath must be unique, is really much simpler.
    • Best of all, the data store is compact! My SQLite file is smaller than 600k (and even < 500k after a VACUUM), while the largest table in that database that holds data for close to 3000 files, when stored as CSV, is close to 400k... So the overhead is reasonable, especially when compared to the results I would get using another database. For example, in MS-Access, the MDB file is typically many times larger than the data stored as text. I once tried a small sample of 10 rows in DBM::Deep, the text file was just a bit over 1k, and the DBM::Deep file was larger than 150k (!).
    • It is fast, query times are typically less than a second, even for more complex queries, like finding duplicates in MD5 digests.
    • Views are a fantastic way to store not so trivial queries in the database, and just use them as if they were tables. For example:
      CREATE VIEW md5_dups AS SELECT md5, COUNT(*) AS instances FROM media GROUP BY md5 HAVING COUNT +(*)>1
      Now you can use that in a join with the original table, to get all file paths that have one or more dups:
      CREATE VIEW media_dups AS SELECT media.* FROM media INNER JOIN md5_dups ON media.md5=md5_dups.md5 ORDER BY media.md5
      And you can use it from anywhere.

    I know that I wouldn't be so happy if I had just used any other SQL database...