Alternatives to DB for comparable lists

peterrowse has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Although there is much written about how to handle large amounts of data efficiently here already, I find myself for the first time in a few years not being able to find a concrete answer to my specific problem, so I wondered if anyone with experience in these matters might comment.

Its probably a database / not database type of point, in that I need to md5 many files across several machines, compare them and find duplicates. Since complete integrity against bit rot is needed I am using the md5 route rather than file size, date etc characteristics.

So the total number of files which will in the end be md5ed will probably be around 750k, and each machine probably will have up to 250k files in its file system. The file size will range from 30GB for very few files to less than a k for many, with most being in the few MB range. But that is probably not particularly important, whats I am having difficulty with is how I store the md5 sums.

I will need to record somehow around 250k path / filename / size and date / md5sums for each machine, 6 machines in total. Then at a later time I will compare each list and work out a copying strategy which minimises copying time over a slow link but makes sure that if there are any files which are named the same but differ in md5 sum they can be checked manually.

So whether I do this with a database on each machine, with each dataset then copied to my processing machine to be compared, or I do it with another perl tool which implements a simpler type of storage, later copied and processed in the same way, I am wondering. I will want to sort or index the list(s) somehow, probably by both md5 sum and filename I am thinking, to be able to be through in checking for duplicates and bad files. In processing the completed md5 lists I will probably want to read each md5 into a large array, check for duplicates with each read, and then check the other file characteristics if a duplicate is detected as it happens, and create a list of results which I can then act on.

Opinions on which route (DB or any particular module which might suit this application) would be greatly appreciated.

Kind regards, Pete

Comment on Alternatives to DB for comparable lists

Replies are listed 'Best First'.
Re: Alternatives to DB for comparable lists by afoken (Chancellor) on May 15, 2018 at 22:29 UTC
complete integrity against bit rot is needed Consider using a filesystem that guarantees exactly that. ZFS does, when using several disks, and can be considered stable. It also offers deduplication, snapshots, transparent compression, replication across machines, and more. btrfs attempts to do the same, or at least some of the important parts, but I would still call it experimental. If you want a no-brainer setup, find a x86_64 machine with RAM maxed out, add some disks, and install FreeNAS or the fork NAS4Free. It natively uses ZFS and doesn't force you to think much about it. Add a similar second machine and use replication for backup. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: Alternatives to DB for comparable lists by peterrowse (Acolyte) on May 16, 2018 at 00:10 UTC
I am already a convert - the destination is ZFS, its great. Problem is all this legacy stuff, spanning many years, many times moved / duplicated from disk to disk etc, although its all on EXT4 now, which seems pretty good, but in the past... So at times I am going to need to work out which are the damaged versions manually. But I hope to hone it down to a few files with the rest all matching from version to version (hopefully a few that is, we will see).	[reply]
Re: Alternatives to DB for comparable lists by mxb (Pilgrim) on May 16, 2018 at 10:04 UTC
If I understand correctly, you wish to obtain the following for each file: MD5 hash Source server File path File name Date of collection Where the files are distributed over six servers. This probably depends upon how you are planning to collect all the data, but my personal approach would be to have a small script running on each of the six servers performing the hashing and sending each result back to a common collector. This assumes network connectivity. I think it would be relatively easy to calculate the tuple of the five items for each server with a script and issue them over the network back to a central collection script. Each server can be hashing and issuing results simultaneously to the same collector. While there may be a lot of data to hash, the actual results are going to be small. Therefore, as you know exactly what you are obtaining (the five items of data) I would just go the easiest route and throw them in a table in DBD::SQLite. Then, once you have all the data in your DB, you can perform offline analysis as much as you want, relatively cheaply. As a side note, I'd probably go with SHA-256 rather than MD5 as MD5 collisions are more common, and it's not that much more computationally expensive.	[reply]
Re^2: Alternatives to DB for comparable lists by cavac (Prior) on May 16, 2018 at 11:59 UTC
To add to your answer, i have a similar system running on some of my servers, indexing some pretty nastily-disorganized windows fileshares. I put everything into a PostgreSQL database. That lets me do all kinds of metadata analysis with a few simple SQL statements. Everything "below a few tens of millions of entries" shouldn't be a problem for a decent low- to midrange server build within the last 8 years. My current, 8 year old, development server is used for this kind of crap all the time without any issues. I'm pretty sure that running fstat() on all those files is going to be a major slowdown, and the checksuming certainly needs to be done locally, not over the network. "For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."	[reply]
Re: Alternatives to DB for comparable lists by Perlbotics (Archbishop) on May 16, 2018 at 18:32 UTC
One approach might be: setup a DB-Server on your collection host run your MD5 tool on each host and depending on your network availability: with networking: contact DB and INSERT the new data on the fly (via internal network or SSH-/VPN-tunnel) w/o networking: output data line by line in a format that your DB supports for batch-loading (store in file for offline transport) run your tasks on the DB Perhaps sending the batch-lines to STDOUT is the easiest approach where the tool could even be invoked by an ssh-command issued on the collection host? That also eliminates the requirement for DB drivers on the host to be scanned. Use a header/trailer or checksum to assert completeness/integrity of the chunk of lines transmitted and perhaps also add some interesting meta-data (creation time, IP, etc.). Update: Oh, you asked for DB-alternatives... Rough estimation: 750k entries with a mean entry size of ca. 500 bytes results in a total size of approx. 375 MB. My experiment with Storable resulted in a file of size 415 MB. Reading/writing took ca. 2.0/3.5s on a moderate PC (3GHz, SSD). Merging and storing all data into a native Perl data structure and using Storable for persistence looks feasible. PRO: fast speed for analytics; CON: no luxury that comes with a DB.	[reply]
Re: Alternatives to DB for comparable lists by peterrowse (Acolyte) on Jun 01, 2018 at 03:28 UTC
Thanks for all the advise, I have gone the DB route - although the storable option was I guess the 'other way' that I was hazily thinking about, I do believe I've used it long in the past, but once I had a bit of a nudge in the DB direction the simplicity and ease for future tweaks won me over. With this type of thing that you're often modifying while you use it, the DB does make it easy to change on a whim, so although I don't quite know what challenges might arise as I work with the data, what I might realise 'ah forgot I might want to do that', if I can avoid calculating all those MD5s (or SHA-256s - I will probably change to that) again, and just update etc in the easiest way, its worth paying a price of slightly reduced performance if there is one. And I didn't know (/ had forgotten possibly, because I've used DBD a lot in the past, but maybe it was only with external DBs) how simple the set up of DBD::SQLite was even on an over stressed laptop. I just took a 12 hour plane trip and coded most of the project during that. Kill two birds with one stone, get a job done and find a way to make a plane journey go a bit faster - I find getting into a bit of code makes time fly. Didn't go the SHA-256 route yet because I didn't have the module installed but will install it for the return trip and hope I have enough work left in the job to keep be occupied for the ride back, because collisions are a concern, even if I can handle them with a last resort diff, they will slow things down because of the low bandwidth between servers. Because access between servers is not consistent, I am going to run the code locally to each server without need for a network, and then once finished or updated transfer the DB files to the processing machine. Some of the machines are quite slow atom types so its best to nice the process and let them do it in their own time, no need for up to the minute results. Then if theres work to do like deleting, making local links (for local duplicates), whatever else I don't know yet, I'll either do that live from the central machine processing script or automatically create local processing scripts. Really though at the moment I am thinking of just consolidating all this data into a single file system that can be kept organised from this point on automatically - ideally through ZFS although I don't know whether it will play nicely with the reliability and speed of the links, I have only used it on single machines so far. Anyways just wanted to say thanks for the help and ideas. Best, Pete	[reply]