bwilli27 has asked for the wisdom of the Perl Monks concerning the following question:

Hard to come up with a concise title, but here's the problem I am trying to solve. Given a directory of zipped log files, I need to perform the following for each file. Each zip file contains a single structured server log file with \r\n trailing each log entry.

*Decompress the file to something in memory, currently using a scalar

*Treat the scalar as if reading a file for input; read line-by-line in order to do something with each line

I am using IO::Uncompress::Unzip to decompress the zip files. Here's a snippet of my code.

use IO::Uncompress::Unzip qw(unzip $UnzipError); my $logDirectory = '/logs'; foreach my $zipFile (glob("$logDirectory/*.zip")) { my $output; print "Decompressing $zipFile to memory\n"; unzip $zipFile => \$output; #print $output; open my $fh, '<', \$output or die $!; while (<$fh>) { #do something } close $fh or die $!; }

The code within <$fh> read loop is never reached. If I print the content of $output after unzipping the file, the correct log data is displayed. I have alternatively tried various methods of splitting $output based on '\r\n' but I believe I'm reaching Perl size limitations; the unzipped data is roughly 2.1GB and Perl bombs on either 'split loop' error or some type of panic.

I am using perl 5.14 on a 64-bit linux machine with 64GB memory. The problem may seem odd, however I am trying to optimize the processing of thousands of compressed server log files. My 'old' perl script writes the decompressed log file to disk, reads that file for processing, and moves on to the next zip. Optimally I want to keep the decompressed content in memory and process the data, only writing to disk the log entries that match my search criteria.

Replies are listed 'Best First'.
Re: Unzip file to scalar, treat scalar as a file
by pmqs (Friar) on May 15, 2014 at 22:10 UTC

    You can skip the expense of holding the complete uncompressed zip file in memory and stream directly from the zip files as shown below.

    use IO::Uncompress::Unzip qw(unzip $UnzipError); my $logDirectory = '/logs'; foreach my $zipFile (glob("$logDirectory/*.zip")) { my $output; print "Processing $zipFile\n"; local $/ = "\r\n"; my $unzip = new IO::Uncompress::Unzip $zipFile or die "Cannot open $zipFile: $UnzipError\n"; while (<$unzip>) { #do something } close $unzip or die $!; }

    Doing it like this means that the complete uncompressed zip file doesn't not need to be stored in memory.

    Also note I set $/ to "\r\n" as you say that the data uses this to delimit each log entry.

Re: Unzip file to scalar, treat scalar as a file
by Laurent_R (Canon) on May 15, 2014 at 21:39 UTC
    I am a bit surprised that 2.1 GB of unzipped data should exhaust the resources of a 64-bit machine with 64 GB memory. My overall gut-feeling is that, even with the overhead involved, it should fit.

    Having said that, one solution might be to ask the operating system to do the unzipping, redirecting the output to your Perl program reading its input line by line. I am doing this quite regularly in a somewhat similar (though different) situation for input files having at least the same order of magnitude (and sometimes significantly more): having a ksh or bash script (my main program) making some initial operations on the input file (for example sorting the data) and piping the output to my Perl program doing line by line reading and doing all the further transformations needed. I found that to be a pretty efficient method, both in terms of memory usage and data volume throughput.

    In some cases, if I remember accurately, I even piped under ksh an unzipping operation, a Unix sort and a Perl program, and I don't remember having exceeded the platform's quotas or limits (well, I sometimes did, but it had to do with wrong ulimit parameters and similar configuration warts; with the proper system configuration, it worked in my experience). So, I would think that this method could apply equally well to your case, although there might obviously be different conditions, one of which is obviously the ability of the decompressing utility to send output to STDOUT raher that to a physical file (not sure which ones can or can't do that).

    Hope this helps.

Re: Unzip file to scalar, treat scalar as a file
by Anonymous Monk on May 15, 2014 at 20:16 UTC

    Use the OO Interface, for example:

    my $z = IO::Uncompress::Unzip->new($input, AutoClose=>1) or die "IO::Uncompress::Unzip failed: $UnzipError\n"; while (<$z>) { # ... }
Re: Unzip file to scalar, treat scalar as a file
by jmacloue (Beadle) on May 16, 2014 at 15:14 UTC

    I'd use the funzip tool from Info-Zip package, just something like:

    open my $log, "-|", "funzip", $zipFile; while(<$log>) { ... }

    This tool just sends its output to stdout from there you can read it line-by-line as you want.