I did a first implementation using
DBD::SQLite (3). No it's not just exactly the same as using any other SQL database. I'm quite happy with the result, for the following reasons:
- There's no need to set up a table the hard way. I just declare the filepath column as text. There's no practical limit on how long the string can be that I can store in it.
- Likewise, I can easily add a unique index on the filepath, without specifying a maximum length. Contrast this to Mysql, where you have to specify a "prefix" length. Anything in a column string beyond that length, is simply ignored for the index (!). So, in SQLite, specifying that my filepath must be unique, is really much simpler.
- Best of all, the data store is compact! My SQLite file is smaller than 600k (and even < 500k after a VACUUM), while the largest table in that database that holds data for close to 3000 files, when stored as CSV, is close to 400k... So the overhead is reasonable, especially when compared to the results I would get using another database. For example, in MS-Access, the MDB file is typically many times larger than the data stored as text. I once tried a small sample of 10 rows in DBM::Deep, the text file was just a bit over 1k, and the DBM::Deep file was larger than 150k (!).
- It is fast, query times are typically less than a second, even for more complex queries, like finding duplicates in MD5 digests.
- Views are a fantastic way to store not so trivial queries in the database, and just use them as if they were tables. For example:
CREATE VIEW md5_dups AS
SELECT md5, COUNT(*) AS instances FROM media GROUP BY md5 HAVING COUNT
+(*)>1
Now you can use that in a join with the original table, to get all file paths that have one or more dups:
CREATE VIEW media_dups AS
SELECT media.* FROM media INNER JOIN md5_dups
ON media.md5=md5_dups.md5 ORDER BY media.md5
And you can use it from anywhere.
I know that I wouldn't be so happy if I had just used any other SQL database...