Yeah, I'm leaning towards a local DNS cache as well. Thanks.
Currently the pool is a hierarchy of directories like this:
pool/ pool/todo pool/doing pool/done
A sample file path is
pool/todo/a/6/a6869c08bcaa2bb6f878de99491efec4f16d0d69
This way readdir() doesn't struggle too much when enumerating the directory's contents, it is trivial to select a random batch of jobs (just generate two random hex numbers between 0 and 16, then read the resulting directory), I get metadata for free (from the filesystem), and I can easily keep track of what jobs are in what state, and recover from errors.
I have quite a lot of symetric bandwidth, but as you say, it's certainly a potential bottleneck. Other than benchmarking and tweaking, are there any good ways to approach this issue?
I'm monitoring the memory pretty closely. I/O is in good shape, and nothing's touching swap. To achieve this with the current architecture I'm limited to about 12 -15 concurrent processes -- this is one of the reasons why I want to improve things.
Does this sound somewhat sensible? :-)
In reply to Re^2: Advice on Efficient Large-scale Web Crawling
by Anonymous Monk
in thread Advice on Efficient Large-scale Web Crawling
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |