kettle has asked for the wisdom of the Perl Monks concerning the following question:

Is there a simple way to read compressed files? I would like to read in a .gz file and then manipulate the compressed text, and output a new compressed .gz file. I can pipe STDOUT to > gzip, but would prefer not to use zcat to pipe things to STDIN. Found a couple cpan modules that support this to some extent, but they seem to be still in development.

Replies are listed 'Best First'.
Re: reading compressed data
by graff (Chancellor) on Dec 13, 2006 at 02:24 UTC
    If you have (or can install) PerlIO::gzip, you can read and write compressed files using an IO layer, like this:
    use PerlIO::gzip; open( INPUT, "<:gzip", "old.gz" ) or die "old.gz: $!"; open( OUTPUT, ">:gzip", "new.gz" ) or die "new.gz: $!"; while (<INPUT>) { # do something with a line of text... s/[\r\n]+/\n/; # for example, normalize line terminations print OUTPUT; }
    If for some reason you have constraints that get in the way of installing non-core modules, but you have "gzip" and "gunzip" on your system (and in your PATH), you can just use pipeline opens:
    open( INPUT, "gunzip < old.gz |" ) or die $!; open( OUTPUT, "| gzip > new.gz" ) or die $!; while (<INPUT>) { # same as above... }
    There are other methods as well, involving other modules (try looking at the search results for gzip at CPAN).

    UPDATE: (2010-10-18) It seems that PerlIO::gzip should be viewed as superseded by PerlIO::via:gzip. (see PerlIO::gzip or PerlIO::via::gzip).

        I haven't done / don't recall seeing any benchmarks comparing PerlIO::gzip against the pipeline open, and I wouldn't hazard a guess that one of them is likely to be significantly faster than the other.

        If it's just a one-shot pass over 5.3 GB, don't sweat it and use whichever one strikes you as more fun. But if this will be an ongoing, oft-repeated process working on lots of data, it might be worth your while to set up a simple test to see if there might be a speed difference.

        In that case, I'd advise against test scripts that only do the i/o. Contrast two versions of the script such that both do everything that needs to be done, and they differ only in the i/o method. If one is faster than the other, you'll get a clear idea of how important the difference is in the context of evertyhing else the script does.

        UPDATE: (2010-10-18) It seems that PerlIO::gzip should be viewed as superseded by PerlIO::via:gzip. (see PerlIO::gzip or PerlIO::via::gzip).

      Awesome, thanks a lot!
Re: reading compressed data
by jasonk (Parson) on Dec 13, 2006 at 02:29 UTC
      I found 4 of the ones you mention, after a 2 minute search, but when there are so many choices and I'm not familiar with any of them, it seems faster to ask, than to read through all of the associated documentation and try to figure out whether there is one I ought to use, or others that I perhaps shouldn't. For example, the first module I googled upon, IO::Uncompress::RawInflate warns "WARNING -- This is a Beta release. Do NOT use in production code." Anyway, thanks for the list.
Re: reading compressed data
by Util (Priest) on Dec 13, 2006 at 02:42 UTC
    Use the "piped" form of open(). Untested code:
    # Old-style: Bareword filehandles and two-arg opens: open IN, "zcat $in_filename|" or die; open OUT, "|gzip -c - > $out_filename" or die; # New style: Lexical filehandles and three-arg opens open my $in_fh, '-|', "zcat $in_filename" or die; open my $out_fh, '|-', "gzip -c - > $out_filename" or die;
    See also "Using open() for IPC" in perlipc.