Re: BASH vs Perl performance

Replies are listed 'Best First'.

Re^2: BASH vs Perl performance
by graff (Chancellor) on Aug 11, 2004 at 02:59 UTC

Nonetheless, it's unlikely that reducing that overhead is going to make a significant dent in your execution time.

I think that would be worth testing. When you're talking about hundreds of thousands of 'mv $a $b' in a shell vs. the equivalent number of 'rename $a, $b' in a perl script, the time (and overall cpu resources) saved by the latter could be well worth the time it takes to write the perl script -- especially if the process is going to be repeated at regular intervals.

I have a handy "shell-loop" tool written in perl (posted here: shloop -- execute shell command on a list) which makes it easy to test this, using the standard "time" utility. I happened to have a set of 23 directories ("20*") holding a total of 3782 files, so I created a second set of 23 empty directories ("to*"), and tried the following:

# rename files from 20* -> to*:

$ find 20* -type f | time shloop -e rename -s ^20:to
       10.62 real         0.39 user         0.27 sys

# now "mv" them back:

$ find to* -type f | time shloop -e mv -s ^to:20
       18.99 real         0.96 user         6.93 sys
[download]

The first case uses the perl-internal "rename" function call to relocate each of the 3782 files to a new directory; in the latter case, shloop opens a shell process ( open(SH, "|/bin/sh")) and prints 3782 successive "mv" commands to that shell. An interesting point to make here is that the first case also had the extra overhead of "growing" the new directories as files were added to them for the first time, whereas the target directories for the "mv" run were already big enough -- but the "mv" run still took almost twice as long (probably because of the overhead involved in creating and destroying all those sub-processes).

This is a bit of a "nonsense" example, of course. Presumably, a shell command like "mv foo/* bar/" (or 23 of them, to handle the above example) would be really quick, because lots of files are moved in a single (compiled) process. But I wrote shloop to handle cases where each individual file needed a name-change as well (e.g. rename "foo/*.bar" to "fub/*.bub"). For this sort of case, a pure shell loop approach has do something like o=`echo $i | sed s/foo(.*)bar/fub$1bub/`; mv $i $o on every iteration, which would take much longer than the "mv" example shown above.

So the moral is: don't underestimate the load imposed by standard shell utilities -- they don't actually scale all that well when invoked in large quantities.

[reply]
[d/l]
[select]

Re^3: BASH vs Perl performance

by Aristotle (Chancellor) on Aug 11, 2004 at 03:18 UTC

You are missing the point.

So you saved 6.5 seconds. Let's assume that as the actual script is more complex and the number of files is larger, you'd manage to shave 30x as much off of its runtime. That's 195 seconds, a little over three minutes. Your code probably does that job 60x or maybe 100x faster than the original script.

These numbers, by any standard, are impressive.

Unfortunately, they kind of pale in comparison to the 2 hours runtime the script currently takes…

Is it worth going to any lengths to take 3 minutes off the runtime of a 2-hour job? Hardly.

But if you can arrange for four parallel downloads (and one doesn't have to go Perl for that — job control is almost the shell's raison d'être), even considering all the other work the script has to do, runtime would drop to something over half an hour. Maybe 45 minutes.

Now, which of the two options seems more worth pursuing?

Makeshifts last the longest.

[reply]

Re^4: BASH vs Perl performance

by graff (Chancellor) on Aug 11, 2004 at 04:19 UTC

You are missing the point. ...
Unfortunately, they kind of pale in comparison to the 2 hours runtime the script currently takes…
Is it worth going to any lengths to take 3 minutes off the runtime of a 2-hour job? Hardly.

jcoxen

I'm suggesting that the sheer quantity of processes being run by the OP's shell script is a major factor in the total time it takes -- it's likely that this is the point. (I'm assuming jcoxen had some evidence for deciding that most of the time was not being taken up by downloading the rar files, but rather in the subsequent shuffling/editing of thousands of data files.)

My test involved a relatively small-scale comparison -- 3000 quick/simple processes vs. 1 perl process; extrapolating from that to a bigger task involving (let me guess) 300,000 quick/simple processes vs. 100 perl processes, I would expect the time savings to be proportional: still nearly 2 to 1, but on the scale of hours instead of seconds.

I wasn't trying to present a specific solution for the given task, or to assert that perl will always be better/faster than a shell script -- I just wanted to highlight the impact of running way too many processes.

[reply]

Re^3: BASH vs Perl performance

by PodMaster (Abbot) on Aug 11, 2004 at 03:09 UTC

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]