jcoxen has asked for the wisdom of the Perl Monks concerning the following question:
1-wget approx. 5000 rar files total from 2 different servers
2-unrar the archives into approx. 16000 files of 4 different types
3-move/rename the files into different directories based on file type
4-run one group of files through a series of sed filters producing a modified set of files
5-copy all files (now approx. 20000) into a matching directory structure on a Windows box via smbclient
This currently takes about 2 hours to run on a reasonably tricked out Sun E-250. Most of the time is taken up in step 3 - simultaneously moving and renaming the files.
My question is this. Since Perl merely acts as a front end to mv and other command line UNIX commands, would I realize a significant performance increase rewriting this script in Perl? The script works fine as is. I'm not interested in adding functionality. I'm concerned only with execution speed.
Thanks,
Jack
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: BASH vs Perl performance
by VSarkiss (Monsignor) on Aug 10, 2004 at 20:07 UTC | |
Since Perl merely acts as a front end to mv and other command line UNIX commands...Not exactly -- Perl calls the underlying rename system call directly, so there's some efficiency gained in not having to fork and exec. Nonetheless, it's unlikely that reducing that overhead is going to make a significant dent in your execution time. If you have time on your hands and are good at Perl, the only way to be sure is to write the program and measure. But I'm guessing it won't make a big difference. | [reply] [d/l] |
by graff (Chancellor) on Aug 11, 2004 at 02:59 UTC | |
I think that would be worth testing. When you're talking about hundreds of thousands of 'mv $a $b' in a shell vs. the equivalent number of 'rename $a, $b' in a perl script, the time (and overall cpu resources) saved by the latter could be well worth the time it takes to write the perl script -- especially if the process is going to be repeated at regular intervals. I have a handy "shell-loop" tool written in perl (posted here: shloop -- execute shell command on a list) which makes it easy to test this, using the standard "time" utility. I happened to have a set of 23 directories ("20*") holding a total of 3782 files, so I created a second set of 23 empty directories ("to*"), and tried the following: This is on a standard intel desktop box running FreeBSD; I expect the results would be comparable (or more dramatic) on other unix flavors. The first case uses the perl-internal "rename" function call to relocate each of the 3782 files to a new directory; in the latter case, shloop opens a shell process ( open(SH, "|/bin/sh")) and prints 3782 successive "mv" commands to that shell. An interesting point to make here is that the first case also had the extra overhead of "growing" the new directories as files were added to them for the first time, whereas the target directories for the "mv" run were already big enough -- but the "mv" run still took almost twice as long (probably because of the overhead involved in creating and destroying all those sub-processes). This is a bit of a "nonsense" example, of course. Presumably, a shell command like "mv foo/* bar/" (or 23 of them, to handle the above example) would be really quick, because lots of files are moved in a single (compiled) process. But I wrote shloop to handle cases where each individual file needed a name-change as well (e.g. rename "foo/*.bar" to "fub/*.bub"). For this sort of case, a pure shell loop approach has do something like o=`echo $i | sed s/foo(.*)bar/fub$1bub/`; mv $i $o on every iteration, which would take much longer than the "mv" example shown above. So the moral is: don't underestimate the load imposed by standard shell utilities -- they don't actually scale all that well when invoked in large quantities. | [reply] [d/l] [select] |
by Aristotle (Chancellor) on Aug 11, 2004 at 03:18 UTC | |
You are missing the point. So you saved 6.5 seconds. Let's assume that as the actual script is more complex and the number of files is larger, you'd manage to shave 30x as much off of its runtime. That's 195 seconds, a little over three minutes. Your code probably does that job 60x or maybe 100x faster than the original script. These numbers, by any standard, are impressive. Unfortunately, they kind of pale in comparison to the 2 hours runtime the script currently takes… Is it worth going to any lengths to take 3 minutes off the runtime of a 2-hour job? Hardly. But if you can arrange for four parallel downloads (and one doesn't have to go Perl for that — job control is almost the shell's raison d'être), even considering all the other work the script has to do, runtime would drop to something over half an hour. Maybe 45 minutes. Now, which of the two options seems more worth pursuing? Makeshifts last the longest. | [reply] |
by graff (Chancellor) on Aug 11, 2004 at 04:19 UTC | |
by PodMaster (Abbot) on Aug 11, 2004 at 03:09 UTC | |
| [reply] |
|
Re: BASH vs Perl performance
by tilly (Archbishop) on Aug 11, 2004 at 00:44 UTC | |
However your tasks all look heavily I/O bound. I/O tends to lend itself well to parallelization. It would take a lot more work, but if you made good use of something like Parallel::ForkManager to parallelize the work, you could get big wins. Suppose that you found that you could run 4 processes at once without them interfering with each other. If you rewrote the whole thing to take advantage of that, then your 2 hour job drops to 30 minutes! You'll have to benchmark to find where you hit the point of diminishing returns from parallelizing, but I'd consider only being able to benefit from 4 processes at once to be a disappointing gain. But before you start having visions of being able to run 8 or 16 processes at once, note that you undoubtably spend at least a little bit of time doing non-parallelizable work. Time spent with, for instance, a remote connection saturated on bandwidth is not going to go away when you parallelize. So it will take more work than you were planning on, but a rewrite should be able to achieve significant performance gains. But only if you look for the performance gains in a different place than you were looking. UPDATE graff's benchmark at Re^3: BASH vs Perl performance suggests that the overhead for launching a process is much higher than I'd have thought. On the order of 0.035 seconds per process on his laptop. If that holds true on the hardware that you're running, stopping launching processes could be worth a lot more performance than I would have thought. | [reply] |
by jcoxen (Deacon) on Aug 11, 2004 at 16:20 UTC | |
Thanks to everyone for the thoughts and comments. Jack | [reply] |
|
Re: BASH vs Perl performance
by Joost (Canon) on Aug 10, 2004 at 20:09 UTC | |
I've noticed solaris machines getting quite slow with file management if you dump more than a couple of thousand files in the same directory, so if that's what your code is doing, try downloading and unrarring in different different directories.
| [reply] [d/l] |
by MidLifeXis (Monsignor) on Aug 10, 2004 at 21:05 UTC | |
The filesystem I was most familiar with when writing 295978 was solaris. See that for an explantion of how to hash directories and why you might want to. --MidLifeXis | [reply] |
|
Re: BASH vs Perl performance
by jfroebe (Parson) on Aug 10, 2004 at 20:02 UTC | |
Hi Jack, Not really... most of the time will be spent in the wget, unrar and copying of the data files. Perhaps the only thing perl *may* improve on is the sed filters... it depends though. Unless your bash script is broken, don't bother migrating it to perl - except maybe as an exercise. Jason No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1 | [reply] |
|
Re: BASH vs Perl performance
by ysth (Canon) on Aug 10, 2004 at 20:18 UTC | |
Keep in mind that there's no reason not to mix bash and perl for different parts. Might be helpful for you to show that part of your bash script. | [reply] |
by jcoxen (Deacon) on Aug 10, 2004 at 21:01 UTC | |
| [reply] [d/l] |
by runrig (Abbot) on Aug 10, 2004 at 21:30 UTC | |
I would change to this: Often the 'echo ... | sed ...' lines can be replaced with shell parameter expansion, which can speed up some scripts. Overall though, I don't know if it'll do much for you. Your big sed pipe could be put into one sed command, and your deletion of multiple dashes looks wrong (especially since you do it twice), do you want this?: s/--*/-/g. And as merlyn might point out, you have a few useless uses of cat. You could use either input redirection or specify the file on the first command. In keeping with your current style, this works in ksh, I don't know about bash: There are cases when sed is faster than perl, and the other way around. Last time I compared, it seemed that when I used alot of character classes (e.g., [0-9], etc.) perl tended to be faster. Update: or maybe when I could replace things like [0-9] in sed with \d in perl, and needed case insensitive matches, which you can't do with the old standard sed. | [reply] [d/l] [select] |
|
Re: BASH vs Perl performance
by waswas-fng (Curate) on Aug 10, 2004 at 22:28 UTC | |
Depending on what you are doing here with wget (FTP, HTTP, HTTPS) you can use LWP or Net::FTP to avoid running 5000 instances (exec's) of wget. 2-unrar the archives into approx. 16000 files of 4 different types This can't really be made faster with perl. 3-move/rename the files into different directories based on file type Avoid this step altogether (do the final placement while doing step 5). 4-run one group of files through a series of sed filters producing a modified set of files You can run this through your script without invoking sed so many times. Perl has built in functionality that can do this task on one process. 5-copy all files (now approx. 20000) into a matching directory structure on a Windows box via smbclient Copy (to the renamed new directory target) the files to the structure that you need o the Windows server at this time. Do not need to move stuff around in step 3 at all. -Waswas | [reply] |
by jcoxen (Deacon) on Aug 11, 2004 at 16:10 UTC | |
1. It's pointed out elsewhere but I'm only doing 2 wgets, not 5000 It was 3 (and to a lesser extent, 4) that prompted my question about porting to Perl. After reading everyone's responses, I think I'll go ahead and port it over. Worst case is I learn some new stuff. Best case is the process runs a whole lot faster. Thanks for your comments, Jack | [reply] |
|
Re: BASH vs Perl performance
by graff (Chancellor) on Aug 11, 2004 at 03:31 UTC | |
I believe it's very likely that "a series of sed filters", applied iteratively to thousands of files to alter their contents, would be slower than a single perl script that applies all the filtering over the full list of files in a single process. And the "File::Copy" module might compare quite favorably to "cp" commands in a shell script -- again, depending on how complicated the process is. (On the other hand, a single "rsync" job might be best for this last step.) When manipulating files by the thousands, it really makes a difference when you can run just a few distinct processes to do it all, rather than thousands of distinct processes. Also, whenever you can do anything to reduce the total number of "intermediate" files created and destroyed in the overall procedure (e.g. keeping whole archive sets in memory and/or doing in-place edits), you will find this to be worthwhile. | [reply] |
by Aristotle (Chancellor) on Aug 11, 2004 at 03:47 UTC | |
Did you look at the actual script? He is using a grand total of 2 wget processes. Hardly a reason to switch to LWP. I stronly doubt that using Archive::Rar which has to mediate between C and Perl data structures is going to be a win over using an external binary for a simple uncompression. He can save mv processes by using xargs. Most of his sed filters can be condensed. Granted, a mediocrely written shell script is going to be much slower than a mediocrely written Perl script, but for the tasks it's doing, shell seems like a more than decent tool. Makeshifts last the longest. | [reply] |
by graff (Chancellor) on Aug 11, 2004 at 04:43 UTC | |
True enough. (I had glossed over that part of the script.) I stronly doubt that using Archive::Rar... is going to be a win Agreed.
He can save mv processes by using xargs.
Most of his sed filters can be condensed.
This is where I'm doubtful -- maybe xargs can support something like what the OP's script is doing, but frankly I think a simple perl script could do it more cogently, and could replace all the sed filtering as well; again, one perl process working on a list of thousands of files will be a win over large numbers of mv and sed jobs, even if xargs is helping. Unless a person is really expert at shell, sed, xargs, etc, while being really new to Perl, I'd think using Perl here would be fruitful and worth the time spent. And on taking the first step, it may be worthwhile to consider how Perl scripting could provide other optimizations that might be hard to acheive in shell scripting. Update: I seem to be contradicting tilly's estimates about the overall impact of process management. I'll stick by that, based partly on the evidence in my "rename vs. mv" test, and on other experience I've had (on Solaris, as it happens), where I altered a perl script from doing something like this: to doing something like this, which produces the same result: The difference was dramatic. In that case, a lot of the overhead was presumably due to starting lots of shell processes, each one running just one cksum, which probably makes it an "unfair" comparison. Still, it was dramatic.
I decided to retest on my macosx laptop, in a directory that includes lots of software distributions: nearly 12,000 data files, and lots of these are very small -- but not all of them: total space consumed is The version with xargs took 7 minutes 5 sec; the version with 12,000 cksum processes run within a single shell (doing slightly more work in perl, but not trying to store the results anywhere) took 14 minutes 13 sec. I'd have to attribute most of the difference to process management issues, and I think there's something missing in tilly's estimates. | [reply] [d/l] [select] |
by tilly (Archbishop) on Aug 11, 2004 at 14:25 UTC | |