in reply to BASH vs Perl performance

Since Perl merely acts as a front end to mv and other command line UNIX commands...
Not exactly -- Perl calls the underlying rename system call directly, so there's some efficiency gained in not having to fork and exec. Nonetheless, it's unlikely that reducing that overhead is going to make a significant dent in your execution time.

If you have time on your hands and are good at Perl, the only way to be sure is to write the program and measure. But I'm guessing it won't make a big difference.

Replies are listed 'Best First'.
Re^2: BASH vs Perl performance
by graff (Chancellor) on Aug 11, 2004 at 02:59 UTC
    Nonetheless, it's unlikely that reducing that overhead is going to make a significant dent in your execution time.

    I think that would be worth testing. When you're talking about hundreds of thousands of 'mv $a $b' in a shell vs. the equivalent number of 'rename $a, $b' in a perl script, the time (and overall cpu resources) saved by the latter could be well worth the time it takes to write the perl script -- especially if the process is going to be repeated at regular intervals.

    I have a handy "shell-loop" tool written in perl (posted here: shloop -- execute shell command on a list) which makes it easy to test this, using the standard "time" utility. I happened to have a set of 23 directories ("20*") holding a total of 3782 files, so I created a second set of 23 empty directories ("to*"), and tried the following:

    # rename files from 20* -> to*: $ find 20* -type f | time shloop -e rename -s ^20:to 10.62 real 0.39 user 0.27 sys # now "mv" them back: $ find to* -type f | time shloop -e mv -s ^to:20 18.99 real 0.96 user 6.93 sys
    This is on a standard intel desktop box running FreeBSD; I expect the results would be comparable (or more dramatic) on other unix flavors.

    The first case uses the perl-internal "rename" function call to relocate each of the 3782 files to a new directory; in the latter case, shloop opens a shell process ( open(SH, "|/bin/sh")) and prints 3782 successive "mv" commands to that shell. An interesting point to make here is that the first case also had the extra overhead of "growing" the new directories as files were added to them for the first time, whereas the target directories for the "mv" run were already big enough -- but the "mv" run still took almost twice as long (probably because of the overhead involved in creating and destroying all those sub-processes).

    This is a bit of a "nonsense" example, of course. Presumably, a shell command like "mv foo/* bar/" (or 23 of them, to handle the above example) would be really quick, because lots of files are moved in a single (compiled) process. But I wrote shloop to handle cases where each individual file needed a name-change as well (e.g. rename "foo/*.bar" to "fub/*.bub"). For this sort of case, a pure shell loop approach has do something like  o=`echo $i | sed s/foo(.*)bar/fub$1bub/`; mv $i $o on every iteration, which would take much longer than the "mv" example shown above.

    So the moral is: don't underestimate the load imposed by standard shell utilities -- they don't actually scale all that well when invoked in large quantities.

      You are missing the point.

      So you saved 6.5 seconds. Let's assume that as the actual script is more complex and the number of files is larger, you'd manage to shave 30x as much off of its runtime. That's 195 seconds, a little over three minutes. Your code probably does that job 60x or maybe 100x faster than the original script.

      These numbers, by any standard, are impressive.

      Unfortunately, they kind of pale in comparison to the 2 hours runtime the script currently takes…

      Is it worth going to any lengths to take 3 minutes off the runtime of a 2-hour job? Hardly.

      But if you can arrange for four parallel downloads (and one doesn't have to go Perl for that — job control is almost the shell's raison d'ętre), even considering all the other work the script has to do, runtime would drop to something over half an hour. Maybe 45 minutes.

      Now, which of the two options seems more worth pursuing?

      Makeshifts last the longest.

        You are missing the point. ...

        Unfortunately, they kind of pale in comparison to the 2 hours runtime the script currently takes…

        Is it worth going to any lengths to take 3 minutes off the runtime of a 2-hour job? Hardly.

        Actually, I would interpret my results to mean that a noticeable amount of excess time may be taken up by the OS doing process management. If I understood jcoxen's problem correctly, the shell script version, which is taking 2 hours, is running many short/simple processes on each file in a set of thousands of files. That's a lot of processes, even if they are just "mv" and "cp" and "sed" and other basic, low-footprint utilities. When you do many thousands of these simple little processes in rapid succession, you can really start to notice how heavy a load process management can be when it's pushed to the limit.

        I'm suggesting that the sheer quantity of processes being run by the OP's shell script is a major factor in the total time it takes -- it's likely that this is the point. (I'm assuming jcoxen had some evidence for deciding that most of the time was not being taken up by downloading the rar files, but rather in the subsequent shuffling/editing of thousands of data files.)

        My test involved a relatively small-scale comparison -- 3000 quick/simple processes vs. 1 perl process; extrapolating from that to a bigger task involving (let me guess) 300,000 quick/simple processes vs. 100 perl processes, I would expect the time savings to be proportional: still nearly 2 to 1, but on the scale of hours instead of seconds.

        I wasn't trying to present a specific solution for the given task, or to assert that perl will always be better/faster than a shell script -- I just wanted to highlight the impact of running way too many processes.

      rename is not equivalent to mv, however it would appear ExtUtils::Command's mv is close enough.

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.