You don't mention what the OS environment is. If it's any sort of unix, have you heard of the 'locate' utility? It has an "updatedb" script that runs at intervals (once a week or whatever) to do a full scan of files visible from a given machine, assuming a multi-file-server (typically NFS) setup with lots of disk volumes on a variety of machines; this builds a big database file with all the file names (full paths) in it. Then the user runs the "locate" command, which is optimized to retrieve all the path strings that match a given substring provided by the user.

If you don't have the 'locate' tool itself, you need something like that approach: have a stand-alone, regularly scheduled process for building an index of file names and their paths on all available disk volumes.

If you know you'll always be searching for the volume(s)/path(s) that contain a given file name, you can optimize the retrieval using just the file name as a hash key and storing the path as the data value (multiple paths containing the same file name would need to be "stringified" -- e.g. as a pipe-delimited list).

A given user with a list (of millions?) of file names should just be hitting on one resource to look up where those files reside. Distributing a global search across all the file servers (hitting them all simultaneously and repeatedly) is going to sink you -- don't do that. Create a central database that users can query for file location data, and where retrievals for a given query can be indexed and optimized.

update: At the very least, you should create a consistent database at each node that lists the files currently on that node and is kept up-to-date at whatever reasonable schedule. Optimizing retrieval from such a database should be pretty simple, so that a querier can ask "is this file on that machine?" and get an efficient anwer without a full disk scan. If you're able to do that, it shouldn't be too much of a step to integrate all the node databases into one master, again on some regular schedule. (Apologies if I'm misunderstanding your question.)


In reply to Re: Searching a distributed filesystem by graff
in thread Searching a distributed filesystem by LostShootingStar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.