Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

I need to read a large number (2300) of fairly large gzipped text files (2M on average), do something with the content of each and save the result (500K-1.5M) again as gzips (keeping the originals just in case they are later needed).

There are quite a few modules to work with gzipped data, and there is also the possibility of using command line tools via backticks. Is there any reason to prefer a module over the command line? Any specific module that is better than others? There aren't really enough ratings to make an informed choice...

I've been using bzip2 until now, but both the soft that creates the original files and Compress::Bzip2 were creating a lot of broken files, so I'd like to find a better solution.

Opinions would be appreciated :-)

Replies are listed 'Best First'.
Re: Recommendation to zip/unzip gzip files
by cavac (Prior) on Jul 03, 2012 at 11:13 UTC

    The disadvantages of using backticks are plenty. In your case, you are starting 4600 external processes, which will take quite a lot of time and resources.

    Using something like Compress::Zlib or IO::Compress::Gzip also let's you handle compressed text files similar to plaintext files, without using temporary (uncompressed) files.

    Sorry for any bad spelling, broken formatting and missing code examples. During a slight disagreement with my bicycle (which i lost), i broke my left forearm near the elbow. I'm doing the best i can here...
Re: Recommendation to zip/unzip gzip files
by zentara (Cardinal) on Jul 03, 2012 at 13:40 UTC
Re: Recommendation to zip/unzip gzip files
by Anonymous Monk on Jul 03, 2012 at 10:46 UTC

    There aren't really enough ratings to make an informed choice... Opinions would be appreciated :-)

    Pick one from 2012/2011 and try it out, it shouldn't take more than half an hour to find a module with an interface you like, that works for your files

    IO::Uncompress::Bunzip2 looks ok to me

    IPC::System::Simple capturex also looks useful

    See also How does one choose among modules?

      I knew there must be something on choosing modules somewhere in the monastery. Couldn't find it easily though. Thanks.
Re: Recommendation to zip/unzip gzip files
by mrguy123 (Hermit) on Jul 03, 2012 at 13:21 UTC
    Hi there,
    I don't have much experience in gzipping files, but recently I needed to parse a gzipped file with size of about 15 GB (~60 GB after decompression)
    I used this fairly simple code to open it
    ##Open file - special case for gzip if ($input_file =~ /\.gz$/) { open(IN, "gunzip -c $input_file |") || die "can't open pipe to + $input_file"; }
    I then parsed it like a regular file and it took me about a minute (amazingly fast)
    Hope this helps
    Mister Guy



    About half of the world's greatest inventions were invented by single men trying to impress women. The other half were invented by married men looking for an excuse to get out of the house
Re: Recommendation to zip/unzip gzip files
by Anonymous Monk on Jul 03, 2012 at 11:02 UTC

    If your data is line-oriented (or otherwise processed in chunks), I'd just open a pipe to an external process:

    # read open my $in, '-|', 'gzip', '-dc', $filename; while (<$in>) { print $_; } close $in; # write open my $out, '|-', 'gzip', '-c', $filename; print $out "blah"; close $out;

      Whoops, that writing doesn't quite work (it prints the compressed data to STDOUT), and I'm not sure how to fix that without invoking the shell.

      With the shell, it goes something like this:

      open my $out, '| gzip -c > ' . $filename;