in reply to Parallel::ForkManager and CPU usage?

Since each process is “processing images,” you will probably be able to run quite a few more processes than you have cores, because each process will spend most of its time waiting for disk-I/O.   The amount of memory won’t be outrageous, either.   You should arrange for the number of processes to be adjustable.   Fiddle with it to find the “sweet spot” for your system.

And, if you are on a Unix/Linux system, don’t forget the -N numprocs parameter of good ol’ xargs.   You could write a simple Perl script that expects the name of a file on the command-line and which processes just one file.   Then, build a file with a list of all the filenames (or pipe an ls command output), and feed that into xargs.   The job is done, in multi-process style, but without writing any complicated Perl code.   Maybe just the ticket if this is a “one-off” task?

Replies are listed 'Best First'.
Re^2: Parallel::ForkManager and CPU usage?
by Jenda (Abbot) on Sep 21, 2014 at 01:56 UTC
    1. Whether the processing will be CPU- or IO- bound depends a lot on the processing. If the processing is complex enough, the IO will be negligible.
    2. If the processes "spend most of their time waiting for disk-IO" and all those images are on the same disk, then starting a lot of processes, all competing for the same disk, is not the best thing to do. Disks nowadays have caches and clever firmware doing read-aheads and other tricks to minimize the need to move the reading heads too much, but with enough processes reading big enough images you can easily render all the caching ineffective and spend time waiting for the heads to move to read the next bit of one of the files. The fact that the tasks are IO-bound doesn't necessarily mean you should start many.
    3. If the processing takes long enough, then starting and destroying a new process for each and every image may not matter much, but it might still help to start eight processes and keep them instead. The easiest solution would be to split the list into eight parts at the start and start a script to process each batch. With thousands of images of a fairly random size, they should all end their work at around the same time, give or take a few images.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re^2: Parallel::ForkManager and CPU usage?
by karlgoethebier (Abbot) on Sep 20, 2014 at 18:02 UTC
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re^2: Parallel::ForkManager and CPU usage?
by trippledubs (Deacon) on Sep 23, 2014 at 02:09 UTC
    Usage: xargs [-0prtx] [--interactive] [--null] [-d|--delimiter=delim] [-E eof-str] [-e[eof-str]] [--eof[=eof-str]] [-L max-lines] [-l[max-lines]] [--max-lines[=max-lines]] [-I replace-str] [-i[replace-str]] [--replace[=replace-str]] [-n max-args] [--max-args=max-args] [-s max-chars] [--max-chars=max-chars] [-P max-procs] [--max-procs=max-procs] [--show-limits] [--verbose] [--exit] [--no-run-if-empty] [--arg-file=file] [--version] [--help] [command [initial-arguments]]

    I think -P maybe, there is no -N, well at least not on Ubuntu or Solaris.

    #ls | xargs -N xargs: invalid option -- 'N'

    Had no idea xargs had that many options