Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a large text file (> 3 GB) that I'd like to search with a Perl script. Can I zip the file and then search the compressed version, without having to fully unzip it to perform the search? I'm willing to accept some speed degradation, of course.

Replies are listed 'Best First'.
Re: Searching within compressed files
by sk (Curate) on Apr 15, 2005 at 16:37 UTC
     open(IN,"zcat $myfile|") or die "Cannot open file: $myfile\n"; # This pipes the contents of the file to IN file handle

    would something like this work? I am assuming you are on a system where zcat works on .Z and .gz files. For other compressions use corresponding utility and pipe it to Perl.

    cheers

    SK Update: Here is a snippet. Sorry I couldn't get the file reading and matching in one line :(

    #!/usr/local/bin/perl -w my $myfile = "test.gz"; open(IN,"zcat $myfile|") or die "Cannot open file: $myfile\n"; while (<IN>) { print ($_) if /mymatch/; }

      Since I'm a big fan of both lexical filehandles, and the safer forms of piping open, here's a version that uses both of those:

      use strict; use warnings; open my $infh, "-|", zcat => $myfile or die "Cannot open file: $myfile\n"; while (<$infh>) { print if /mymatch/; }

      This avoids using a global filehandle, and would allow you to take the filename in as a command line argument, without worrying about escaping nasty characters. I think it's a pretty big benefit for a fairly small change in the code.

      That works! I was wrestling with Compress:Zlib, but I kept getting "insufficient memory" errors, even on a much smaller test file.

      Thanks!

Re: Searching within compressed files
by gam3 (Curate) on Apr 15, 2005 at 17:20 UTC
    If you are using gzopen and gzread($buffer) you should not be running out of memory. as you will get data in 4096 byte blocks.
    -- gam3
    A picture is worth a thousand words, but takes 200K.