Industrial strength archiving

rinceWind has asked for the wisdom of the Perl Monks concerning the following question:

I have a problem, in that I am trying to generate a zip archive of whole file systems' worth of data (multiple gigabytes, lots of files). If I use zip or tar from the command line, this is fine. But I don't want to pick every file; ideally I want a perl module to do the archiving and compression function, writing out to disk as It goes.

I have tried Archive::Zip, and this worked for a small test set. But when it comes to full production size, the process gobbles up more and more memory, as the module is building the archive in memory, until I get failures to add files - the program exits without leaving behind a valid zip archive. Several hours wasted :(.

As an alternative, I could pipe filenames to a command line utility, but I was hoping to do this all in perl.

I know that Archive::TarGzip is capable of reading individual files from an archive without slurping the entire archive into memory. What I am looking for is the equivalent for writing an archive. If there is no archive module that can work on disk rather than all in memory, I might have a go at writing one.

--
I'm Not Just Another Perl Hacker

Comment on Industrial strength archiving

Replies are listed 'Best First'.
Re: Industrial strength archiving by Eyck (Priest) on Sep 27, 2004 at 15:34 UTC
Unfortunatelly 'Industrial strength archiving' means non-perl, (well, could be perl if someone would sit and create some libs and utils...) Why? Because heavyweigth archiving means bypassing(at least partly) filesystem structure (i.e. - not "open /, list all dirs, open every dir, list all files, archive them... etc...", but: linearly walk bytestream and archive what's marked for archiving ). At this point in time only tools from `dump` family can do tricks like that. I use mostly xfs, and `xfsdump` fits the bill, and beats all perl-lib-accessible methods like Archive::Tar by so wide margin that they're not really in the same competition. ( and on heavilly loaded system this means - xfsdump finishes dump, and tools like tar produce only unusable garbage ). Now, for compression...`lzop` is great tool, I find it extremely usefull (it offers very fast compression with compression ratios hovering around what `gzip -1` achieves). And it can compress streams, so you shouldn't have any trouble with piping to it and from it from perl. Another toy, that I found very neat and usefull in Archiving business is `rzip`. This is definitelly not `industrial strength` type of solution, because it's very young ( and for your use, it cannot and will not support compressing streams... ), but it easily outperforms `bzip2 -9` by a healthy 10-30%. This is very significant achievement, what is surprising, is that while working on typical backup archives (multi-gigabyte files) it works sometimes several times faster then bzip2, while still outperforming it on compression ratio front. Of course you need rather healthy machine to run it, because it's working set hovers around 0.5G...	[reply]
Re: Industrial strength archiving by meredith (Friar) on Sep 27, 2004 at 15:56 UTC
In addition to the other answers, you might see if bacula can do what you want. You can set it up to use file-based storage, with one job per file. If you don't want to bother with the director, catalog, and such running for a restore, you can use the `bextract` standalone tool. `mhoward - at - hattmoward.org`	[reply]
Re: Industrial strength archiving by zentara (Cardinal) on Sep 27, 2004 at 17:30 UTC
It's not Perl, but it's industrial strength and totally free. .....dar You could use it via system, it has the ability to call perl scripts, it would be great if someone with C knowledge could write a perl frontend to libdar. I'm not really a human, but I play one on earth. flash japh	[reply]
Re: Industrial strength archiving by graff (Chancellor) on Sep 28, 2004 at 02:46 UTC
The command line tar (or zip or other) utility can build an archive file of any "strength" based on a simple list of files to include. You could just write a simple perl script to walk the directory tree and print the names of files that meet your exacting specifications. Then run a tar (or zip or other) command with that list as input. Use whatever you want in the perl script to walk the file structure, but I'll mention (for the second or third time) that File::Find and related modules are relatively very slow, compared to reading the output of the command-line "find" utility.	[reply]
Re: Industrial strength archiving by xorl (Deacon) on Sep 27, 2004 at 14:56 UTC
If there is no archive module that can work on disk rather than all in memory, I might have a go at writing one. Go for it! Off hand I can't think of anything that can do what you want. Sorry.	[reply]