Reading between the lines of your post and making a few assumptions, I think I would go for a different architecture than the one you describe for this.
Rather than having multiple, all-in-one fetch-parse-store processes, I'd split the concerns into three processes.
Unless you need the extras that Mechanize gives you, I'd use the (much) lighter LWP::Simple::getstore() for this. One instances per thread; two or three threads per core; feeding off a common Thread::Queue can easily saturate the bandwidth of most connections in (say) 100 MB.
would spawn as many concurrent copies of ...
... as are either a) required to keep up with the inbound data rate; or b) the box can handle memory and/or processor wise.
The monitor script could be based on threads & system or a module like Parallel::ForkManager depending upon your OS and preferences.
This separation of concerns would allow you to easily vary both the number of fetcher threads; and the number of HTML parsers; to allow you to match them to the bandwidth available and the processing power and memory limits of the box(es) this will run on; whilst keeping each of the 3 components very linear, and easy to program.
In reply to Re: Are there any memory-efficient web scrapers?
by BrowserUk
in thread Are there any memory-efficient web scrapers?
by Anonymous Monk
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |