If you don't have the 'locate' tool itself, you need something like that approach: have a stand-alone, regularly scheduled process for building an index of file names and their paths on all available disk volumes.
If you know you'll always be searching for the volume(s)/path(s) that contain a given file name, you can optimize the retrieval using just the file name as a hash key and storing the path as the data value (multiple paths containing the same file name would need to be "stringified" -- e.g. as a pipe-delimited list).
A given user with a list (of millions?) of file names should just be hitting on one resource to look up where those files reside. Distributing a global search across all the file servers (hitting them all simultaneously and repeatedly) is going to sink you -- don't do that. Create a central database that users can query for file location data, and where retrievals for a given query can be indexed and optimized.
update: At the very least, you should create a consistent database at each node that lists the files currently on that node and is kept up-to-date at whatever reasonable schedule. Optimizing retrieval from such a database should be pretty simple, so that a querier can ask "is this file on that machine?" and get an efficient anwer without a full disk scan. If you're able to do that, it shouldn't be too much of a step to integrate all the node databases into one master, again on some regular schedule. (Apologies if I'm misunderstanding your question.)
In reply to Re: Searching a distributed filesystem
by graff
in thread Searching a distributed filesystem
by LostShootingStar
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |