Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Fellow Monks,

I am on windows 2000, using activestate perl. I have the following piece of code. It simply grabs a load of text files from a directory and zips them up for archiving. It works well for a small number of text files, but I've got 3,500 plus text files, and it's taking anything up to an hour to complete. Any hints on how to speed this thing up? (The resulting zip file has to be compatible with winzip).
#!/usr/bin/perl use warnings; use strict; use Archive::Zip qw( :ERROR_CODES :CONSTANTS); my $date = "text files"; my $zip = Archive::Zip->new(); opendir DIR, $date; my @Filelist = grep { /.txt$/ } readdir DIR; foreach my $file(@Filelist){ my $member = $zip->addFile("$date\\$file"); die 'Error writing file' if $zip->writeToFileNamed("$date.zip" +) != AZ_OK; } print "Finished\n"
Thanks in advance,
Jonathan.

Replies are listed 'Best First'.
Re: Archive::Zip performance question
by Albannach (Monsignor) on Feb 16, 2005 at 16:24 UTC
    Well one thing that will clearly not scale well is that you are writing the entire ZIP file after each new member is added to the archive, and that ZIP just gets bigger each time. Perhaps try it with the writeToFileNamed outside the loop?

    Update: I just noticed you didn't escape the period in your grep, but I imagine you probably wanted to.

    --
    I'd like to be able to assign to an luser

      My programming teacher in school (ahh, long ago) told us this story once:

      One day Schlemihl got a new job at a company that paints the white markings on the street. On the first day Schlemihl tagged 3 miles of the street with markings. The Boss was very pleased and thought about paying him a bonus. On the second day Schlemihl did 1,2 miles and on the third day he only made 300 yards.
      Whatīs up?, asked the boss. You made 1.2 miles on the first day and now you come back with wimpy 300 yards. Donīt you want to keep your job?

      Well., said Schlemihl, The way to the paint bucket gets longer and longer.
      Since then I know the term Schlemihl Algorithm for a algorithm that does not scale well.


      holli, /regexed monk/
        Just for fun, I tired to reason out how far Schlemihl walked that first day, with the following assumptions:
        Each stripe is 24 inches long, and separated from the next stripe by 24 inches
        The paint brush holds enough paint to paint 1 stripe
        There are 1311 stripes per block (1 block = 1 mile, with 36 foot streets separating the blocks that are unpainted)

        Where T = trip number, S = stripe length, I = interval between stripes
        That makes each trip = T*(S*2)+I(2(T-1))
        or Trip 1 = 48"
        Trip 2 = 144"
        T3 = 240" etc.

        total trip = T1+T2+T3...

        adding in the width of the streets crossed between blocks, Schlemihl walked an amazing 11,772.18 MILES to paint 3933 stripes on 3 miles of road. At a speed of no less than 1471.5 MPH! (assuming an 8 hour shift)

        Assuming he could carry a bucket that held enough paint to paint 3 miles of stripes, he could paint that 3 miles in about 6 seconds.

        On the other hand, his performace at a more human speed of 2.5 MPH, he would only be able to paint about 162 stripes in 8 hours (using the old method), or about 12% of a block, and his performace drops even faster than in the example given. He ends up taking 2.5 years to paint stripes on a 3 mile road.

        Thanks for the diversion! :-)
      I agree, move the writeToFileNamed outside the loop. Instead of adding each file to 1 zip, you are doing something like the following:
      • Zip up file 1
      • Zip up file 1 and 2
      • Zip up file 1, 2, and 3
      • Zip up file 1, 2, 3, and 4
      You should add all the files, then writeToFileNamed at the end.
Re: Archive::Zip performance question
by NiJo (Friar) on Feb 16, 2005 at 20:17 UTC
    Jonathan,

    Besides the rewriting issue there are some other optimizations possible.

    Rule 1 of optimization: Don't do it. If this is a nightly backup job: Who cares about run time?

    Something seems to be wrong with your file organization. Your single directory has at least 3500 files and still needs picking out text files. What kind of mess is this?

    Don't reinvent the wheel. Assuming a clean directory structure:  system("$commandline_packer $options $source_dir $zipfile"). Or even leave file globbing to the command interpreter: system("$commandline_packer $options *.txt $zipfile"). Estimated speed increase: Factor 3 to 10. 7-Zip is the GPL successor to Winzip on my system. This optimization is how it was done for ages.

    Assuming many short text files the most time will be spent by disk seeks. During that time the CPU is idle. Using two threads the previous file can be compressed during seek and read of another.

    At the numbers as large as you have, it even might pay out to help the OS in reducing disk seeks. The idea is to create your file list, stat() it (needs to be done anyway implicitely) and sort by inode. Only then start reading. Assuming a more or less continous mapping between inode and physical disk location you reduced the number of disk seeks to a minimum. Sometimes you are able to hear and see this optimization. I'm not sure whether stat() has inode numbers on Windows. It did not on my outddated ActiveState Perl version.

    Archive::Zip is much more usefull if the files have been munged by perl before compression. TMTOWTDI.

Re: Archive::Zip performance question
by Miguel (Friar) on Feb 16, 2005 at 22:56 UTC
    I spent 86 seconds to add 5.000 .txt files (only 6 Kb each...) to a zip file:
    #!/usr/bin/perl -w use strict; use Archive::Zip; use IO::All; my $io = io("."); my $zip = Archive::Zip->new(); my $t1 = time(); while ( $_ = $io->next ) { $zip->addFile("$_") if $_ =~/\.txt$/; } $zip->writeToFileNamed('ZipFile.zip'); print "\n", time() - $t1, "\n";

    I'm using Linux on a PIV 1.7Ghz with 256Mb Ram

Re: Archive::Zip performance question
by Courage (Parson) on Feb 16, 2005 at 18:16 UTC
    Archive::Zip is pure-perl
    I mean it is pure-perl when navigating ZIP file and it feeds stream to Compress::Zlib which is C

    IMHO Compress::Zlib is too OO and it spends much time in pure-perl.

    From my own experience, (I used zip from both perl and Tcl/Tk), Tcl/Tk has vfs::zip which is faster just because it is etirely pure-c.

    I had benefits of Archive::Zip to be pure-perl so I was able to run it on my PocketPC device without painful recompiling, and I had benefits of vfs::zip being faster.

    Now I use Tcl/Tk un-zipper from my perl proggrams.
    Honestly, I can show working example, if needed.