rinceWind has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem, in that I am trying to generate a zip archive of whole file systems' worth of data (multiple gigabytes, lots of files). If I use zip or tar from the command line, this is fine. But I don't want to pick every file; ideally I want a perl module to do the archiving and compression function, writing out to disk as It goes.

I have tried Archive::Zip, and this worked for a small test set. But when it comes to full production size, the process gobbles up more and more memory, as the module is building the archive in memory, until I get failures to add files - the program exits without leaving behind a valid zip archive. Several hours wasted :(.

As an alternative, I could pipe filenames to a command line utility, but I was hoping to do this all in perl.

I know that Archive::TarGzip is capable of reading individual files from an archive without slurping the entire archive into memory. What I am looking for is the equivalent for writing an archive. If there is no archive module that can work on disk rather than all in memory, I might have a go at writing one.

--
I'm Not Just Another Perl Hacker

Replies are listed 'Best First'.
Re: Industrial strength archiving
by Eyck (Priest) on Sep 27, 2004 at 15:34 UTC
    Unfortunatelly 'Industrial strength archiving' means non-perl, (well, could be perl if someone would sit and create some libs and utils...)

    Why? Because heavyweigth archiving means bypassing(at least partly) filesystem structure (i.e. - not "open /, list all dirs, open every dir, list all files, archive them... etc...", but: linearly walk bytestream and archive what's marked for archiving ).

    At this point in time only tools from dump family can do tricks like that. I use mostly xfs, and xfsdump fits the bill, and beats all perl-lib-accessible methods like Archive::Tar by so wide margin that they're not really in the same competition. ( and on heavilly loaded system this means - xfsdump finishes dump, and tools like tar produce only unusable garbage ).

    Now, for compression...lzop is great tool, I find it extremely usefull (it offers very fast compression with compression ratios hovering around what gzip -1 achieves). And it can compress streams, so you shouldn't have any trouble with piping to it and from it from perl.

    Another toy, that I found very neat and usefull in Archiving business is rzip. This is definitelly not industrial strength type of solution, because it's very young ( and for your use, it cannot and will not support compressing streams... ), but it easily outperforms bzip2 -9 by a healthy 10-30%.

    This is very significant achievement, what is surprising, is that while working on typical backup archives (multi-gigabyte files) it works sometimes several times faster then bzip2, while still outperforming it on compression ratio front.

    Of course you need rather healthy machine to run it, because it's working set hovers around 0.5G...

Re: Industrial strength archiving
by meredith (Friar) on Sep 27, 2004 at 15:56 UTC

    In addition to the other answers, you might see if bacula can do what you want. You can set it up to use file-based storage, with one job per file. If you don't want to bother with the director, catalog, and such running for a restore, you can use the bextract standalone tool.

    mhoward - at - hattmoward.org
Re: Industrial strength archiving
by zentara (Cardinal) on Sep 27, 2004 at 17:30 UTC
    It's not Perl, but it's industrial strength and totally free. .....dar

    You could use it via system, it has the ability to call perl scripts, it would be great if someone with C knowledge could write a perl frontend to libdar.


    I'm not really a human, but I play one on earth. flash japh
Re: Industrial strength archiving
by graff (Chancellor) on Sep 28, 2004 at 02:46 UTC
    The command line tar (or zip or other) utility can build an archive file of any "strength" based on a simple list of files to include. You could just write a simple perl script to walk the directory tree and print the names of files that meet your exacting specifications. Then run a tar (or zip or other) command with that list as input.

    Use whatever you want in the perl script to walk the file structure, but I'll mention (for the second or third time) that File::Find and related modules are relatively very slow, compared to reading the output of the command-line "find" utility.

Re: Industrial strength archiving
by xorl (Deacon) on Sep 27, 2004 at 14:56 UTC
    If there is no archive module that can work on disk rather than all in memory, I might have a go at writing one.
    Go for it! Off hand I can't think of anything that can do what you want. Sorry.