Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm using LWP to download large files. To avoid storing the entire file in memory I'm reading in smaller chunks (:read_size_hint => 4096) and using callbacks (:content_cb => \&mysub). What I want to do is remove a small header that will only be present in the first chunk. So as I download that first chunk, I'd like to uncompress it, perform a substitution on the text (removing the small header), recompress the chunk, and write it out, as well as the subsequent chunks. This way I don't have to uncompress the whole file after download and recompress it again.

I've tried this using IO::Uncompress::Gunzip and IO::Compress::Gzip (since Compress::Zlib recommends using them) to accomplish this. See the test case I've included, which takes a gzip file and tries to emulate this behaviour by splitting it in two chunks, modifies the first chunk and tries gluing them back together again.

This doesn't work, as it the second chunk is considered trailing junk. Is it possible to do what I'm trying?

use IO::Uncompress::Gunzip; use IO::Compress::Gzip; my $gzfile = shift; open my $fh, $gzfile or die "$gzfile: $!\n"; my $buf = do { local $/; <$fh> }; my ($p1, $p2) = unpack("a4096a*", $buf); my $ugz = IO::Uncompress::Gunzip->new(\$p1); $ugz->read(my $gbuf); $ugz->close; $gbuf =~ s/.*?(?=^[^%\n])//ms; my $cgz = IO::Compress::Gzip->new(\ my $z, -Level => 9); $cgz->syswrite($gbuf); $cgz->close; syswrite STDOUT, $z; syswrite STDOUT, $p2;

Replies are listed 'Best First'.
Re: Can I modify a single chunk of a gzip stream?
by Corion (Patriarch) on Aug 12, 2007 at 20:50 UTC

    According to the gzip algorithm, which you should read, compression is done by storing pointers of reoccurring strings back to where they first occurred, using a 32k bytes window. So, basically, it will be unlikely that you will be able to cut out anything from the stream without having to reencode the whole remainder of the stream.

    Another problem will be the Huffman-encoded dictionary of literals which will also vary when you cut out stuff from the start of the file.

Re: Can I modify a single chunk of a gzip stream?
by graff (Chancellor) on Aug 12, 2007 at 20:46 UTC
    I'm not sure this will help, but if you have a situation where you are dealing with input and output file handles, and the corresponding "files" are both supposed to be gzip compressed, you probably want to try PerlIO::gzip -- it sets up (un)compression as a PerlIO layer:
    use PerlIO::gzip; open my $fh, "<:gzip", $gzfile or die "$gzfile: $!\n"; binmode STDOUT, ":gzip"; while(<$fh>) { # do stuff print; }
    (worked for me on a simple task)

    UPDATE: (2010-10-18) It seems that PerlIO::gzip should be viewed as superseded by PerlIO::via:gzip. (see PerlIO::gzip or PerlIO::via::gzip).

Re: Can I modify a single chunk of a gzip stream?
by Raster Burn (Beadle) on Aug 13, 2007 at 15:35 UTC

    I think this would be easier if you just did this with the shell and pipes:

    curl ... | gunzip -c | sed -e script | gzip -c

    You can also s/sed/perl/ :)