He is using a grand total of 2 wget processes.
True enough. (I had glossed over that part of the script.)
I stronly doubt that using Archive::Rar... is going to be a win
Agreed.
He can save mv processes by using xargs.
Most of his sed filters can be condensed.
This is where I'm doubtful -- maybe xargs can support something like what the OP's script is doing, but frankly I think a simple perl script could do it more cogently, and could replace all the sed filtering as well; again, one perl process working on a list of thousands of files will be a win over large numbers of mv and sed jobs, even if xargs is helping.
Unless a person is really expert at shell, sed, xargs, etc, while being really new to Perl, I'd think using Perl here would be fruitful and worth the time spent. And on taking the first step, it may be worthwhile to consider how Perl scripting could provide other optimizations that might be hard to acheive in shell scripting.
Update: I seem to be contradicting tilly's estimates about the overall impact of process management. I'll stick by that, based partly on the evidence in my "rename vs. mv" test, and on other experience I've had (on Solaris, as it happens), where I altered a perl script from doing something like this:
my @cksums;
my @files = `find $path -type f`; # apologies to etcshadow
chomp @files;
push @cksums, `cksum $_` for ( @files );
to doing something like this, which produces the same result:
my @cksums = `find $path -type f | xargs cksums`;
The difference was dramatic. In that case, a lot of the overhead was presumably due to starting lots of shell processes, each one running just one cksum, which probably makes it an "unfair" comparison. Still, it was dramatic.
I decided to retest on my macosx laptop, in a directory that includes lots of software distributions: nearly 12,000 data files, and lots of these are very small -- but not all of them: total space consumed is 10 5 GB (oops- forgot the "-k" flag on du). To make it less lopsided, I compared these -- in the order shown (in case there was an advantage to going second):
time perl -e '@cksums = `find . -type f -print0 | xargs -0 cksum`'
time perl -e '$/="\x0";
open(I,"find . -type f -print0 |" );
open(SH,"|/bin/sh");
while (<I>) { chomp; print SH "cksum \"$_\" > /dev/null\n" }'
The version with xargs took 7 minutes 5 sec; the version with 12,000 cksum processes run within a single shell (doing slightly more work in perl, but not trying to store the results anywhere) took 14 minutes 13 sec. I'd have to attribute most of the difference to process management issues, and I think there's something missing in tilly's estimates. |