Searching a distributed filesystem

LostShootingStar has asked for the wisdom of the Perl Monks concerning the following question:

Hello, im working on a large distributed file system environment, that is, any given file can exist on any given "node" in the system. Each node is interconnected on an internal network. The task im trying to acomplish is the following: a user supplies a "file list", that is, a file with a list of newline seperated filenames. I need to search every node in the system (sometimes upward of 64 nodes), for each file in the list. The approach im using now is: For each "node", i create a thread, then i fork a process, exec ssh and run 'perl' with it,I dup the stdin/stdout of the ssh process to the perl script. Next, i send what i call a "remote perl script" over the ssh connection, followed by "__END__\n". the remote script looks like the following:

sub remote_script {
    my ($mode) = @_;
    
    if ($mode eq "test") {
        return '
            $|=1;
            print "READY\n";
            while (<STDIN>) {
                   chomp;
                   my $found = glob($_);
                   print "$found\n";
            }
        ';
    }
}
[download]

Once all this is established, i start itterating the input file (which can be millions and millions of lines), and sending the filenames over the ssh connection to this remote script, then it waits for a response. when it gets a response, it sends the next line. the semi-full code looks like this:


sub process3 {

        my ($start,$end,$node) = @_;

        my %workers;
        my ($cur, $line, $pos);
        my $done = 0;
        my ($rnode, $obj);


        my @file;
        tie @file, 'Tie::File', "inputfile" or die "couldnt tie";

                $workers{$node} = open_handle($node, "glob"); #this op
+ens the ssh connection.
                my $to_node = $workers{$node}[WRITE];
                my $from_node = $workers{$node}[READ];
                $workers{$node}[SENT] = $start;
                $line = $file[$workers{$node}[SENT]];
                print $to_node "$cur\n";
                $workers{$node}[SENT]++;

         while(1){
                        my $res;
                        $res = $from_node->getline();

                        chomp($res);

                        ($obj, $rnode) = split(',',$res);
                        print "$obj\n" if $res;
                        last if ($workers{$node}[SENT] > $end);

                        $line = $file[$workers{$node}[SENT]];

                        print $to_node  "$line\n" unless($workers{$nod
+e}[SENT] > $end );
                        $workers{$node}[SENT]++;
        }


}

#i know i can put this in a loop, but i decided to leave it for clarit
+y.
my $thr1 = threads->new(\&process3, 0,$endline, "c001n05");
my $thr2 = threads->new(\&process3, 0,$endline, "c001n06" );
my $thr3 = threads->new(\&process3, 0,$endline, "c001n07" );
my $thr4 = threads->new(\&process3, 0,$endline, "c001n08" );
my $thr5 = threads->new(\&process3, 0,$endline, "c001n09" );
my $thr6 = threads->new(\&process3, 0,$endline, "c001n10" );
my $thr7 = threads->new(\&process3, 0,$endline, "c001n11" );
my $thr8 = threads->new(\&process3, 0,$endline, "c001n12" );
my $thr9 = threads->new(\&process3, 0,$endline, "c001n13" );
my $thr10 = threads->new(\&process3, 0,$endline, "c001n14" );
my $thr11 = threads->new(\&process3, 0,$endline, "c001n15" );
my $thr12 = threads->new(\&process3, 0,$endline, "c001n16" );
$thr1->join();
$thr2->join();
$thr3->join();
$thr4->join();
$thr5->join();
$thr6->join();
$thr7->join();
$thr8->join();
$thr9->join();
$thr10->join();
$thr11->join();
$thr12->join();
[download]

The problem is that this gets really slow when there is a large inputfile, taking up to 45 min to search for 1,000,000 lines. Id like to see this improve, even a little bit. If anyone has any advice for this, please share. Thank you!

Comment on Searching a distributed filesystem Select or Download Code

Replies are listed 'Best First'.
Re: Searching a distributed filesystem by BrowserUk (Patriarch) on Apr 16, 2007 at 04:53 UTC
Here are a couple of things that may speed up your process. Make your remote script more intelligent. Don't glob for every filename. This is the equivalent of looking things up in an array, except slower because you're going out to the filesystem each time. Use a hash. glob once with a full wildcard and put results into a hash. Then each time the remote script recieves a filename to lookup, it does it using a O(1) memory lookup rather than a O(n) filesystem hit. Don't tie the filelist in every thread. I see no advantage to using Tie::File over simply opening the file for input in each thread and reading the filenames one line at a time. Update: If you need better performance, you could get into opening multipe sessions to each server. The messy bit is synchronising the accesses to the file list. If you want to go that route, and have difficulty on seeing how to synchronise them, come back. I don't for the life of me understand your 'clarity' argument for 24 (128!) lines versus 8(16) `my @threads = map{ threads->new( \&process3, 0, $endline, $_ } qw[ c001n05 c001n06 c001n07 c001n08 c001n09 c001n10 c001n11 c001n12 c001n13 c001n14 c001n15 c001n16 ]; $_->join for @threads;` [download] Clearer to read and much easier to maintain when the server list changes. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Searching a distributed filesystem by GrandFather (Saint) on Apr 16, 2007 at 03:14 UTC
Not your real problem I know, and you are aware that it is sub optimum, but: `my $thr1 = threads->new(\&process3, 0,$endline, "c001n05"); my $thr2 = threads->new(\&process3, 0,$endline, "c001n06" ); ... $thr1->join(); $thr2->join(); ...` [download] just calls out for either a hash or an array: `my @thr; for (1..$numThreads) { push @thr, threads->new(\&process3, 0,$endline, sprintf "c001n%02d +", 4 + $_); } $_->join () for @thr;` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Searching a distributed filesystem by varian (Chaplain) on Apr 16, 2007 at 06:06 UTC
Once all this is established, i start itterating the input file (which can be millions and millions of lines), and sending the filenames over the ssh connection to this remote script, then it waits for a response. when it gets a response, it sends the next line You are peforming a huge number of roundtrips of small packets over the network between collaborating processes and your processes over the network have to wait for each other's input. This is notoriously slow (just try it over a wide area network to get a feel for the impact of the roundtrips in general). Even while the network itself is fast the fact that your processes have to wait for roundtrip results will slow down the entire operation. Therefore your algorithm may benefit greatly from transfer of larger chunks in one shot over the network. In particular I would suggest to copy the searchlist to the nodes (http/ftp-like protocol, Perl module LWP), then execute a fully local search (stdin from the local search list file, stdout to a local result file on the same node) and only afterwards transfer the result file back in one shot to the central system for report.	[reply]
Re: Searching a distributed filesystem by graff (Chancellor) on Apr 16, 2007 at 04:05 UTC
You don't mention what the OS environment is. If it's any sort of unix, have you heard of the 'locate' utility? It has an "updatedb" script that runs at intervals (once a week or whatever) to do a full scan of files visible from a given machine, assuming a multi-file-server (typically NFS) setup with lots of disk volumes on a variety of machines; this builds a big database file with all the file names (full paths) in it. Then the user runs the "locate" command, which is optimized to retrieve all the path strings that match a given substring provided by the user. If you don't have the 'locate' tool itself, you need something like that approach: have a stand-alone, regularly scheduled process for building an index of file names and their paths on all available disk volumes. If you know you'll always be searching for the volume(s)/path(s) that contain a given file name, you can optimize the retrieval using just the file name as a hash key and storing the path as the data value (multiple paths containing the same file name would need to be "stringified" -- e.g. as a pipe-delimited list). A given user with a list (of millions?) of file names should just be hitting on one resource to look up where those files reside. Distributing a global search across all the file servers (hitting them all simultaneously and repeatedly) is going to sink you -- don't do that. Create a central database that users can query for file location data, and where retrievals for a given query can be indexed and optimized. update: At the very least, you should create a consistent database at each node that lists the files currently on that node and is kept up-to-date at whatever reasonable schedule. Optimizing retrieval from such a database should be pretty simple, so that a querier can ask "is this file on that machine?" and get an efficient anwer without a full disk scan. If you're able to do that, it shouldn't be too much of a step to integrate all the node databases into one master, again on some regular schedule. (Apologies if I'm misunderstanding your question.)	[reply]
Re^2: Searching a distributed filesystem by LostShootingStar (Novice) on Apr 16, 2007 at 04:12 UTC
They system already uses a berkly database system. unfortunately, the whole point of this project is because the current tools that do the kind of thing you describe, break once in a while. the tool im working on needs to basically verify what is ACTUALLY available at the filesystem level, not what "should" be available. What im really looking for is possibly a better approach to the overall design algorithm of my code. i feel it could be accomplished more effectivly. at the highest level, i need to send a filename to each node in the system, then have the node figure out if the file exists on that node (using globbing, because we dont always have the full path), and send the full path of the filename back if its found.	[reply]
Re^3: Searching a distributed filesystem by Anonymous Monk on Apr 16, 2007 at 14:29 UTC
the tool im working on needs to basically verify what is ACTUALLY available at the filesystem level, not what "should" be available So leverage locate, and filter the results?	[reply]
Re: Searching a distributed filesystem by shmem (Chancellor) on Apr 16, 2007 at 07:30 UTC
The problem is that this gets really slow when there is a large inputfile, taking up to 45 min to search for 1,000,000 lines. I'd suspect Tie::File being the bottleneck; try to work out a solution without it. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]