in reply to Searching a distributed filesystem
If you don't have the 'locate' tool itself, you need something like that approach: have a stand-alone, regularly scheduled process for building an index of file names and their paths on all available disk volumes.
If you know you'll always be searching for the volume(s)/path(s) that contain a given file name, you can optimize the retrieval using just the file name as a hash key and storing the path as the data value (multiple paths containing the same file name would need to be "stringified" -- e.g. as a pipe-delimited list).
A given user with a list (of millions?) of file names should just be hitting on one resource to look up where those files reside. Distributing a global search across all the file servers (hitting them all simultaneously and repeatedly) is going to sink you -- don't do that. Create a central database that users can query for file location data, and where retrievals for a given query can be indexed and optimized.
update: At the very least, you should create a consistent database at each node that lists the files currently on that node and is kept up-to-date at whatever reasonable schedule. Optimizing retrieval from such a database should be pretty simple, so that a querier can ask "is this file on that machine?" and get an efficient anwer without a full disk scan. If you're able to do that, it shouldn't be too much of a step to integrate all the node databases into one master, again on some regular schedule. (Apologies if I'm misunderstanding your question.)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Searching a distributed filesystem
by LostShootingStar (Novice) on Apr 16, 2007 at 04:12 UTC | |
by Anonymous Monk on Apr 16, 2007 at 14:29 UTC |