Re: Alternatives to DB for comparable lists

If I understand correctly, you wish to obtain the following for each file:

MD5 hash
Source server
File path
File name
Date of collection

Where the files are distributed over six servers.

This probably depends upon how you are planning to collect all the data, but my personal approach would be to have a small script running on each of the six servers performing the hashing and sending each result back to a common collector. This assumes network connectivity.

I think it would be relatively easy to calculate the tuple of the five items for each server with a script and issue them over the network back to a central collection script. Each server can be hashing and issuing results simultaneously to the same collector.

While there may be a lot of data to hash, the actual results are going to be small. Therefore, as you know exactly what you are obtaining (the five items of data) I would just go the easiest route and throw them in a table in DBD::SQLite.

Then, once you have all the data in your DB, you can perform offline analysis as much as you want, relatively cheaply.

As a side note, I'd probably go with SHA-256 rather than MD5 as MD5 collisions are more common, and it's not that much more computationally expensive.

Comment on Re: Alternatives to DB for comparable lists

Replies are listed 'Best First'.
Re^2: Alternatives to DB for comparable lists by cavac (Prior) on May 16, 2018 at 11:59 UTC
To add to your answer, i have a similar system running on some of my servers, indexing some pretty nastily-disorganized windows fileshares. I put everything into a PostgreSQL database. That lets me do all kinds of metadata analysis with a few simple SQL statements. Everything "below a few tens of millions of entries" shouldn't be a problem for a decent low- to midrange server build within the last 8 years. My current, 8 year old, development server is used for this kind of crap all the time without any issues. I'm pretty sure that running fstat() on all those files is going to be a major slowdown, and the checksuming certainly needs to be done locally, not over the network. "For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."	[reply]

Replies are listed 'Best First'.

Re^2: Alternatives to DB for comparable lists
by cavac (Prior) on May 16, 2018 at 11:59 UTC

To add to your answer, i have a similar system running on some of my servers, indexing some pretty nastily-disorganized windows fileshares. I put everything into a PostgreSQL database. That lets me do all kinds of metadata analysis with a few simple SQL statements.

Everything "below a few tens of millions of entries" shouldn't be a problem for a decent low- to midrange server build within the last 8 years. My current, 8 year old, development server is used for this kind of crap all the time without any issues.

I'm pretty sure that running fstat() on all those files is going to be a major slowdown, and the checksuming certainly needs to be done locally, not over the network.

"For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."

[reply]