in reply to BASH vs Perl performance

As mentioned previously, Perl offers the LWP module that can fetch all 5000 rar files from within a single process. Also, there is an Archive::Rar module (though I haven't used this, and I'm not sure if it would help you). Depending on how intricate the move/rename step is (e.g. are you changing the name of each data file in a large set, as well as putting it into a separate directory?), perl could do this much faster than a shell script -- see my reply above.

I believe it's very likely that "a series of sed filters", applied iteratively to thousands of files to alter their contents, would be slower than a single perl script that applies all the filtering over the full list of files in a single process. And the "File::Copy" module might compare quite favorably to "cp" commands in a shell script -- again, depending on how complicated the process is. (On the other hand, a single "rsync" job might be best for this last step.)

When manipulating files by the thousands, it really makes a difference when you can run just a few distinct processes to do it all, rather than thousands of distinct processes. Also, whenever you can do anything to reduce the total number of "intermediate" files created and destroyed in the overall procedure (e.g. keeping whole archive sets in memory and/or doing in-place edits), you will find this to be worthwhile.

Replies are listed 'Best First'.
Re^2: BASH vs Perl performance
by Aristotle (Chancellor) on Aug 11, 2004 at 03:47 UTC

    Did you look at the actual script?

    He is using a grand total of 2 wget processes. Hardly a reason to switch to LWP.

    I stronly doubt that using Archive::Rar which has to mediate between C and Perl data structures is going to be a win over using an external binary for a simple uncompression.

    He can save mv processes by using xargs.

    Most of his sed filters can be condensed.

    Granted, a mediocrely written shell script is going to be much slower than a mediocrely written Perl script, but for the tasks it's doing, shell seems like a more than decent tool.

    Makeshifts last the longest.

      He is using a grand total of 2 wget processes.

      True enough. (I had glossed over that part of the script.)

      I stronly doubt that using Archive::Rar... is going to be a win

      Agreed.

      He can save mv processes by using xargs.

      Most of his sed filters can be condensed.

      This is where I'm doubtful -- maybe xargs can support something like what the OP's script is doing, but frankly I think a simple perl script could do it more cogently, and could replace all the sed filtering as well; again, one perl process working on a list of thousands of files will be a win over large numbers of mv and sed jobs, even if xargs is helping.

      Unless a person is really expert at shell, sed, xargs, etc, while being really new to Perl, I'd think using Perl here would be fruitful and worth the time spent. And on taking the first step, it may be worthwhile to consider how Perl scripting could provide other optimizations that might be hard to acheive in shell scripting.

      Update: I seem to be contradicting tilly's estimates about the overall impact of process management. I'll stick by that, based partly on the evidence in my "rename vs. mv" test, and on other experience I've had (on Solaris, as it happens), where I altered a perl script from doing something like this:

      my @cksums; my @files = `find $path -type f`; # apologies to etcshadow chomp @files; push @cksums, `cksum $_` for ( @files );
      to doing something like this, which produces the same result:
      my @cksums = `find $path -type f | xargs cksums`;
      The difference was dramatic. In that case, a lot of the overhead was presumably due to starting lots of shell processes, each one running just one cksum, which probably makes it an "unfair" comparison. Still, it was dramatic.

      I decided to retest on my macosx laptop, in a directory that includes lots of software distributions: nearly 12,000 data files, and lots of these are very small -- but not all of them: total space consumed is 10 5 GB (oops- forgot the "-k" flag on du). To make it less lopsided, I compared these -- in the order shown (in case there was an advantage to going second):

      time perl -e '@cksums = `find . -type f -print0 | xargs -0 cksum`' time perl -e '$/="\x0"; open(I,"find . -type f -print0 |" ); open(SH,"|/bin/sh"); while (<I>) { chomp; print SH "cksum \"$_\" > /dev/null\n" }'
      The version with xargs took 7 minutes 5 sec; the version with 12,000 cksum processes run within a single shell (doing slightly more work in perl, but not trying to store the results anywhere) took 14 minutes 13 sec. I'd have to attribute most of the difference to process management issues, and I think there's something missing in tilly's estimates.
        Back of the envelope estimates vs benchmarks. As long as you're benchmarking the right thing, the benchmark is always worth more. ;-)