in reply to Searching a distributed filesystem

You don't mention what the OS environment is. If it's any sort of unix, have you heard of the 'locate' utility? It has an "updatedb" script that runs at intervals (once a week or whatever) to do a full scan of files visible from a given machine, assuming a multi-file-server (typically NFS) setup with lots of disk volumes on a variety of machines; this builds a big database file with all the file names (full paths) in it. Then the user runs the "locate" command, which is optimized to retrieve all the path strings that match a given substring provided by the user.

If you don't have the 'locate' tool itself, you need something like that approach: have a stand-alone, regularly scheduled process for building an index of file names and their paths on all available disk volumes.

If you know you'll always be searching for the volume(s)/path(s) that contain a given file name, you can optimize the retrieval using just the file name as a hash key and storing the path as the data value (multiple paths containing the same file name would need to be "stringified" -- e.g. as a pipe-delimited list).

A given user with a list (of millions?) of file names should just be hitting on one resource to look up where those files reside. Distributing a global search across all the file servers (hitting them all simultaneously and repeatedly) is going to sink you -- don't do that. Create a central database that users can query for file location data, and where retrievals for a given query can be indexed and optimized.

update: At the very least, you should create a consistent database at each node that lists the files currently on that node and is kept up-to-date at whatever reasonable schedule. Optimizing retrieval from such a database should be pretty simple, so that a querier can ask "is this file on that machine?" and get an efficient anwer without a full disk scan. If you're able to do that, it shouldn't be too much of a step to integrate all the node databases into one master, again on some regular schedule. (Apologies if I'm misunderstanding your question.)

Replies are listed 'Best First'.
Re^2: Searching a distributed filesystem
by LostShootingStar (Novice) on Apr 16, 2007 at 04:12 UTC
    They system already uses a berkly database system. unfortunately, the whole point of this project is because the current tools that do the kind of thing you describe, break once in a while. the tool im working on needs to basically verify what is ACTUALLY available at the filesystem level, not what "should" be available. What im really looking for is possibly a better approach to the overall design algorithm of my code. i feel it could be accomplished more effectivly. at the highest level, i need to send a filename to each node in the system, then have the node figure out if the file exists on that node (using globbing, because we dont always have the full path), and send the full path of the filename back if its found.
      the tool im working on needs to basically verify what is ACTUALLY available at the filesystem level, not what "should" be available
      So leverage locate, and filter the results?