BASH vs Perl performance

by graff (Chancellor) on Aug 11, 2004 at 02:59 UTC

Nonetheless, it's unlikely that reducing that overhead is going to make a significant dent in your execution time.

I think that would be worth testing. When you're talking about hundreds of thousands of 'mv $a $b' in a shell vs. the equivalent number of 'rename $a, $b' in a perl script, the time (and overall cpu resources) saved by the latter could be well worth the time it takes to write the perl script -- especially if the process is going to be repeated at regular intervals.

I have a handy "shell-loop" tool written in perl (posted here: shloop -- execute shell command on a list) which makes it easy to test this, using the standard "time" utility. I happened to have a set of 23 directories ("20*") holding a total of 3782 files, so I created a second set of 23 empty directories ("to*"), and tried the following:

# rename files from 20* -> to*:

$ find 20* -type f | time shloop -e rename -s ^20:to
       10.62 real         0.39 user         0.27 sys

# now "mv" them back:

$ find to* -type f | time shloop -e mv -s ^to:20
       18.99 real         0.96 user         6.93 sys
[download]

The first case uses the perl-internal "rename" function call to relocate each of the 3782 files to a new directory; in the latter case, shloop opens a shell process ( open(SH, "|/bin/sh")) and prints 3782 successive "mv" commands to that shell. An interesting point to make here is that the first case also had the extra overhead of "growing" the new directories as files were added to them for the first time, whereas the target directories for the "mv" run were already big enough -- but the "mv" run still took almost twice as long (probably because of the overhead involved in creating and destroying all those sub-processes).

This is a bit of a "nonsense" example, of course. Presumably, a shell command like "mv foo/* bar/" (or 23 of them, to handle the above example) would be really quick, because lots of files are moved in a single (compiled) process. But I wrote shloop to handle cases where each individual file needed a name-change as well (e.g. rename "foo/*.bar" to "fub/*.bub"). For this sort of case, a pure shell loop approach has do something like o=`echo $i | sed s/foo(.*)bar/fub$1bub/`; mv $i $o on every iteration, which would take much longer than the "mv" example shown above.

So the moral is: don't underestimate the load imposed by standard shell utilities -- they don't actually scale all that well when invoked in large quantities.

[reply]
[d/l]
[select]

by Aristotle (Chancellor) on Aug 11, 2004 at 03:18 UTC

You are missing the point.

So you saved 6.5 seconds. Let's assume that as the actual script is more complex and the number of files is larger, you'd manage to shave 30x as much off of its runtime. That's 195 seconds, a little over three minutes. Your code probably does that job 60x or maybe 100x faster than the original script.

These numbers, by any standard, are impressive.

Unfortunately, they kind of pale in comparison to the 2 hours runtime the script currently takes…

Is it worth going to any lengths to take 3 minutes off the runtime of a 2-hour job? Hardly.

But if you can arrange for four parallel downloads (and one doesn't have to go Perl for that — job control is almost the shell's raison d'être), even considering all the other work the script has to do, runtime would drop to something over half an hour. Maybe 45 minutes.

Now, which of the two options seems more worth pursuing?

Makeshifts last the longest.

Re^4: BASH vs Perl performance

by graff (Chancellor) on Aug 11, 2004 at 04:19 UTC

by PodMaster (Abbot) on Aug 11, 2004 at 03:09 UTC

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

Re: BASH vs Perl performance
by tilly (Archbishop) on Aug 11, 2004 at 00:44 UTC

~~an absurd~~

Update:

However your tasks all look heavily I/O bound. I/O tends to lend itself well to parallelization. It would take a lot more work, but if you made good use of something like Parallel::ForkManager to parallelize the work, you could get big wins. Suppose that you found that you could run 4 processes at once without them interfering with each other. If you rewrote the whole thing to take advantage of that, then your 2 hour job drops to 30 minutes!

You'll have to benchmark to find where you hit the point of diminishing returns from parallelizing, but I'd consider only being able to benefit from 4 processes at once to be a disappointing gain. But before you start having visions of being able to run 8 or 16 processes at once, note that you undoubtably spend at least a little bit of time doing non-parallelizable work. Time spent with, for instance, a remote connection saturated on bandwidth is not going to go away when you parallelize.

So it will take more work than you were planning on, but a rewrite should be able to achieve significant performance gains. But only if you look for the performance gains in a different place than you were looking.

UPDATE graff's benchmark at Re^3: BASH vs Perl performance suggests that the overhead for launching a process is much higher than I'd have thought. On the order of 0.035 seconds per process on his laptop. If that holds true on the hardware that you're running, stopping launching processes could be worth a lot more performance than I would have thought.

by jcoxen (Deacon) on Aug 11, 2004 at 16:20 UTC

Thanks to everyone for the thoughts and comments.

Jack

Re: BASH vs Perl performance
by Joost (Canon) on Aug 10, 2004 at 20:09 UTC

rename

I've noticed solaris machines getting quite slow with file management if you dump more than a couple of thousand files in the same directory, so if that's what your code is doing, try downloading and unrarring in different different directories.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

[reply]
[d/l]

by MidLifeXis (Monsignor) on Aug 10, 2004 at 21:05 UTC

The filesystem I was most familiar with when writing 295978 was solaris. See that for an explantion of how to hash directories and why you might want to.

--MidLifeXis

Re: BASH vs Perl performance
by jfroebe (Parson) on Aug 10, 2004 at 20:02 UTC

Hi Jack,

Not really... most of the time will be spent in the wget, unrar and copying of the data files. Perhaps the only thing perl *may* improve on is the sed filters... it depends though. Unless your bash script is broken, don't bother migrating it to perl - except maybe as an exercise.

Jason

No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

Re: BASH vs Perl performance
by ysth (Canon) on Aug 10, 2004 at 20:18 UTC

Keep in mind that there's no reason not to mix bash and perl for different parts.

Might be helpful for you to show that part of your bash script.

by jcoxen (Deacon) on Aug 10, 2004 at 21:01 UTC


cd RAR

echo "Getting RAR files from North Server"
wget -q ftp://username:password@xxx.xxx.xxx.xxx/*.rar;type=i
echo "Getting RAR files from South Server"
wget -q ftp://username:password@xxx.xxx.xxx.xxx/*.rar;type=i

echo "Unpacking RAR files"
unrar -o+ -inul e *raw

echo "Removing .rar files"
rm *.rar

echo "Renaming files and moving them to directories by type"
for src in *;
do
    type=$(echo $src | sed -e "s/^.*_//" | sed -e "s/.report//")
    tgt=$(echo $src | sed -e "s/\(^.*\)\.\(.*_.*\)/\2/")
    echo "Moving $src to ../$type/$tgt"
    mv $src ../$type/$tgt
done

cd ../CRS

echo "Processing CRS files"
for src in *.report;
do
    # Set some variables
    type=$(echo $src | sed -e "s/^........//" | sed -e "s/\(^...\).*/\
+1/")
    dest=$(echo $src | sed -e "s/report/csv/")
    tid=$(echo $src | sed -e "s/_.*$//")
    echo "Src= $src"
    echo "Type=$type"
    echo "Dest=$dest"
    echo "TID= $tid"
    echo ""

    # Check to see if this is a Flashwave
    if   [ $type = "FOS" ] \
      || [ $type = "FOT" ] \
      || [ $type = "FOU" ] ;
    then

        # If it is a Flashwave...
        echo "This is a flashwave"
        cat $src                    |
        sed -e "/FILL,0,$/d" > /tmp/sedtemp

    else

        # If this is NOT a Flashwave...
        echo "This is NOT a flashwave"
        cat $src                    |
        sed -e "/FILL,0,$/d"        |
        sed -n "/,[1-2][,-].*,$/ {
            h
            N
            s/^.*,\(.*\),$/\1/
            H
            x
            s/\n//g
            p
        }
        /,[0-9]\{1,2\}.*,$/ p" > /tmp/sedtemp

    fi

    # Find  each Port ID section and duplicate it using , instead of -
    cat /tmp/sedtemp    |
    uniq                |
    # Special case - no dash in Port ID, just a single number
    #sed -e 's/\(,[0-9]\{1,2\}\)$/\1,,,,,/'              |
    # Cleanup caused by special case
    #sed -e 's/,,,,,\([0-9]\{1,2\}\)$/,,,,\1,,,,,/'      |
    # Special case - no dash in Port ID, just a single number
    sed -e 's/\(,[0-9]\{1,2\}\),$/\1,,,,,/'              |
    # Main Port ID reformat
    sed -e 's/,\([0-9]\{0,2\}\)-\([0-9]\{0,2\}\)-\{,1\}\([0-9]\{0,2\}\
+)-\{,1\}\([0-9]\{0,2\}\)-\{,1\}\([0-9]\{0,2\}\)/,\1-\2-\3-\4-\
5-,\1,\2,\3,\4,\5/g'   |
    # Delete mulitple dashes
    sed -e 's/--/-/g'   |
    # Do it again just to make sure
    sed -e 's/--/-/g'   |
    # Delete trailing dashes
    sed -e 's/-,/,/g' > $dest

    # Update the tidlist
    echo "Updating tidlist"
    echo $tid >> ../tidlist.txt
    echo ""

done
[download]

[reply]
[d/l]

by runrig (Abbot) on Aug 10, 2004 at 21:30 UTC

tid=$(echo $src | sed -e "s/_.*$//")
[download]

tid=${src%%_*}
[download]

s/--*/-/g

merlyn

useless uses of cat

<file \
sed s/this/that/g
[download]

[0-9]

\d

[reply]
[d/l]
[select]

Re: BASH vs Perl performance
by waswas-fng (Curate) on Aug 10, 2004 at 22:28 UTC

Depending on what you are doing here with wget (FTP, HTTP, HTTPS) you can use LWP or Net::FTP to avoid running 5000 instances (exec's) of wget.

This can't really be made faster with perl.

Avoid this step altogether (do the final placement while doing step 5).

You can run this through your script without invoking sed so many times. Perl has built in functionality that can do this task on one process.

Copy (to the renamed new directory target) the files to the structure that you need o the Windows server at this time. Do not need to move stuff around in step 3 at all.

-Waswas

by jcoxen (Deacon) on Aug 11, 2004 at 16:10 UTC

1. It's pointed out elsewhere but I'm only doing 2 wgets, not 5000
3&5. I thought about combining these when I was writing the original script but to do that, I'd have to fire off smbclient several thousand times. Given my past experience with Windows networking, this struck me as something to be avoided at all costs. :)
4. I don't doubt this at all. This is a ver 1.0 script so I haven't done anything about tightening up the code yet. I've been concentrating on getting it fully functional and running without any problems.

It was 3 (and to a lesser extent, 4) that prompted my question about porting to Perl. After reading everyone's responses, I think I'll go ahead and port it over. Worst case is I learn some new stuff. Best case is the process runs a whole lot faster.

Thanks for your comments,

Jack

Re: BASH vs Perl performance
by graff (Chancellor) on Aug 11, 2004 at 03:31 UTC

my reply above

I believe it's very likely that "a series of sed filters", applied iteratively to thousands of files to alter their contents, would be slower than a single perl script that applies all the filtering over the full list of files in a single process. And the "File::Copy" module might compare quite favorably to "cp" commands in a shell script -- again, depending on how complicated the process is. (On the other hand, a single "rsync" job might be best for this last step.)

When manipulating files by the thousands, it really makes a difference when you can run just a few distinct processes to do it all, rather than thousands of distinct processes. Also, whenever you can do anything to reduce the total number of "intermediate" files created and destroyed in the overall procedure (e.g. keeping whole archive sets in memory and/or doing in-place edits), you will find this to be worthwhile.

by Aristotle (Chancellor) on Aug 11, 2004 at 03:47 UTC

Did you look at the actual script?

He is using a grand total of 2 wget processes. Hardly a reason to switch to LWP.

I stronly doubt that using Archive::Rar which has to mediate between C and Perl data structures is going to be a win over using an external binary for a simple uncompression.

He can save mv processes by using xargs.

Most of his sed filters can be condensed.

Granted, a mediocrely written shell script is going to be much slower than a mediocrely written Perl script, but for the tasks it's doing, shell seems like a more than decent tool.

Makeshifts last the longest.