dnquark has asked for the wisdom of the Perl Monks concerning the following question:

Update: Resolved; the original problem seems to have been caused by a corrupt *gz file...
------------

Problem synopsis: reading in a gzipped ASCII file via IO::Uncompress::Gunzip stalls after about 2,000,000 lines, with the memory usage growing.

Details: I am reading lines of ASCII data stored in large (~1 gb gzipped files). I tried opening them via open $fh, "gunzip -c $fn |" but then the script quickly runs out of memory and dies.
I decided to try IO::Uncompress::Gunzip. But then simply reading the file line by line appears to fail:
my $z = new IO::Uncompress::Gunzip $ARGV[0]; while( <$z> ){ chomp; print "$.:$_\n"; }
stalls after about 2,200,000 lines. The memory usage of the script grows, but it appears to just be sitting there stuck... Can anyone shed some light onto what is going on, and hopefully suggest another way I can read large gz files directly from Perl?..

Replies are listed 'Best First'.
Re: Problems reading large gz files with IO::Uncompress::Gunzip
by NetWallah (Canon) on Dec 30, 2008 at 02:12 UTC
Re: Problems reading large gz files with IO::Uncompress::Gunzip
by tilly (Archbishop) on Dec 30, 2008 at 03:23 UTC
    If the script runs out of memory with the original pipeline approach then you are leaking memory. Are you doing something like keeping a copy of all of the data you have seen in a data structure? Perl won't have enough memory available to do that.
      The original question might be resolved: it appears that the *gz file I was using for testing is somehow corrupt (gunzip fails on it, reporting "unexpected end of file").

      I still wonder whether there are any benefits to using the IO::Uncompress::Gunzip as opposed to a pipe. Generally, are there any caveats I should be aware of when dealing with large *gz files in Perl?.. Feel free to contribute bits of wisdom, but the original question is as of now moot.
        Sorry - I have not had the need to explore IO::Uncompress::Gunzip, so I cannot offer direct experience or advice.
        It just seemed like a worthwhile option to explore when you were up against limited memory and large files.
        Hopefully our brethren ans sisteren here have profounder experiences to offer.

             ..to maintain is to slowly feel your soul, sanity and sentience ebb away as you become one with the Evil.

        I think the biggest benefit is when you pair it with IO::Uncompress::AnyUncompress: At that point you can treat compressed and uncompressed files the same. A single open statement will open either, transparently.