Re: randomising file order returned by File::Find

perl script that is run > 100x on a cluster to process 1000s of 3d brain images.

Even if the 1000s become low millions, it would be far more efficient to have a single script that scans the directory hierarchy and builds a big list in memory and then partitions the matching files into 100+ lists (1 per cluster instance) and writes the to separate files. It then initiates the processes on the cluster instances passing the name of one of those files to it. This is simple to implement and avoids the need for locking files entirely.

It could suffer from one problem though, that of imbalanced processing, if there is any great variability in the time taken to process individual images.

If that were the case, I'd opt for a slightly more sophisticated scheme. It have the directory scanning process open a server port that responded to inbound connections by returning the name of the next file to be processed. Each cluster instance then connects, gets the name of a file to process, closes the connection and processes the file; connecting again when it is ready to do another. Again, not a complicated scheme to program, but one that ensures balanced workloads across the cluster, and completely avoids the need for locking or synchronisation.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re: randomising file order returned by File::Find

Replies are listed 'Best First'.
Re^2: randomising file order returned by File::Find by jeffa (Bishop) on Mar 01, 2011 at 19:20 UTC
"... [script] builds a big list in memory and then partitions the matching files into 100+ lists (1 per cluster instance) and writes the to separate files." This is pretty much what Hadoop does for you. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re^3: randomising file order returned by File::Find by BrowserUk (Patriarch) on Mar 01, 2011 at 22:23 UTC
The downside of that mechanism is control. If, as the OP says later, the need to suspend or terminate the processing early arises, then you're stuck with starting the whole process over from scratch. Same thing if the number of workers varies up or down. With server/clients approach, pause and restart the clients, or knock out half the clients--or double them--and the processing continues without duplication and automatically redistributes to accommodate the changes. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: randomising file order returned by File::Find by jeffa (Bishop) on Mar 01, 2011 at 22:27 UTC
"... then you're stuck with starting the whole process over from scratch." True ... but Hadoop scales linearly, meaning what used to take multiple hours or days to run now only takes a few hours, maybe even a few minutes. Such termination becomes trivial. I do not know how familiar you are with Hadoop/cloud computing. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re^5: randomising file order returned by File::Find by BrowserUk (Patriarch) on Mar 01, 2011 at 22:37 UTC
Re^6: randomising file order returned by File::Find by jeffa (Bishop) on Mar 01, 2011 at 22:42 UTC
Some notes below your chosen depth have not been shown here
Re^3: randomising file order returned by File::Find by DrHyde (Prior) on Mar 02, 2011 at 10:31 UTC
Trouble with this is that it doesn't make the best use of your hardware if you have machines that run at different speeds or if some data files take longer to process than others. When I was trying to solve a similar problem (in my case, rendering individual frames of video), using whatever spare cycles were available across a whole bunch of machines (so different amounts of CPU were available on different boxes and at different times) my solution was for the individual renderers to request work units from a master, and rather than just mounting the master's filesystem and hoping for the best, they made a request to my own application. My application was a simple perl script that they accessed over telnet. The script was only working on its local filesystem so locking worked reliably, and simply told each client the filename that it should next work on. The clients then grabbed that file using NFS. That's what I think you should do rather than randomising the list - randomising will reduce the problem, but won't eliminate it. However, if you do want to randomise, then the `wanted` function should build up a list instead of doing any processing on the files. You then shuffle that list, and only after that do you process the files.	[reply] [d/l]