Re: Alternatives to DB for comparable lists

One approach might be:

setup a DB-Server on your collection host
run your MD5 tool on each host and depending on your network availability:

with networking: contact DB and INSERT the new data on the fly (via internal network or SSH-/VPN-tunnel)
w/o networking: output data line by line in a format that your DB supports for batch-loading (store in file for offline transport)

run your tasks on the DB

Perhaps sending the batch-lines to STDOUT is the easiest approach where the tool could even be invoked by an ssh-command issued on the collection host? That also eliminates the requirement for DB drivers on the host to be scanned.

Use a header/trailer or checksum to assert completeness/integrity of the chunk of lines transmitted and perhaps also add some interesting meta-data (creation time, IP, etc.).

Update:

Oh, you asked for DB-alternatives... Rough estimation: 750k entries with a mean entry size of ca. 500 bytes results in a total size of approx. 375 MB. My experiment with Storable resulted in a file of size 415 MB. Reading/writing took ca. 2.0/3.5s on a moderate PC (3GHz, SSD).

Merging and storing all data into a native Perl data structure and using Storable for persistence looks feasible. PRO: fast speed for analytics; CON: no luxury that comes with a DB.

Comment on Re: Alternatives to DB for comparable lists