http://qs1969.pair.com?node_id=875268

rizzy has asked for the wisdom of the Perl Monks concerning the following question:

I have about 3500 tar.gz files which each contain several thousand text files which I want to parse to look for keywords. Right now, I'm unzipping and decompressing each of the archives, parsing each text file, and then deleting everything and moving on to the next file. Is it possible to parse (using simple regular expressions) one of these archives without unzipping/decompressing? A colleague mentioned that you can use pipes to do this with certain types of archives. Much of the time in running my code is spent unzipping, etc. I couldn't find any info using the super search or google, so maybe it isn't possible.

Replies are listed 'Best First'.
Re: Parse a tar.gz file without unzipping and uncompressing unzipping?
by ww (Archbishop) on Dec 03, 2010 at 20:32 UTC
    I have several thousand bottles of adult beverages I'd like to drink without removing the tops.

    Google hasn't told me how to do that, either.

    Using pipes or other devices may spare you (in the best cases) a bit of time writing the unzipped file to disk, but it won't do diddly about reducing unzip and untar times.

      You may not be able to drink the beverages without removing the tops, but you can surely look inside and describe the contents. Google it.

        It's a bit difficult to tell a solution of HCl from plain water or thinner or Everclear et cetera without opening the bottle. And that's assuming the bottle isn't opaque or colored. Google me. I mean, uh, what was I saying?

Re: Parse a tar.gz file without unzipping and uncompressing unzipping?
by Illuminatus (Curate) on Dec 03, 2010 at 21:09 UTC
      Thanks. That's what I'm currently doing. It's manageable right now, but I figured I'd check to see if it could be further improved.
        From your post, it seemed like you were unpacking into a filesystem. Do the majority of the files you parse each time you run the program change? If not, you could use something like like KinoSearch to allow you to quickly search files that have not changed.

        fnord

Re: Parse a tar.gz file without unzipping and uncompressing unzipping?
by talexb (Chancellor) on Dec 03, 2010 at 22:06 UTC

    I was told about zgrep recently and was gobsmacked that I'd never heard of it before. Don't know if that will do the trick, but give it a shot.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      By using zcat you can set up a pipeline to uncompress and extract the files from the archive as a stream without using temporary files or needing to hold everything in RAM.

      On Linux the code looks like this:

      #!/usr/bin/perl use strict; use warnings; my $file = shift; my $pipecmd = "zcat $file | tar -O -xf -"; # -O, extract files to sta +ndard output open(my $PIPEIN, '-|', $pipecmd) or die "Opening pipe [$pipecmd]: $!\n +"; while ( my $line = <$PIPEIN> ) { chomp $line; print "$line\n"; # do parsing here }

      A Windows version of gzip is available from The gzip home page. I'm not sure if the Win version includes zcat.

      You'll need a Windows tar as well. I'm not sure if the pipe command will work in Windows.

      This may need better handing of file not found and other errors from the pipe command, but it should get you started.

        my $pipecmd = "zcat $file | tar  -O -xf -"; # -O, extract files to standard output

        Your tar may even have support for uncompressing built in, look for the tar option "-Z" for uncompress and "-z" for unzip.

      Thanks, Alex. That may have been what my friend was referring to. He said he wasn't sure if what he had in mind would work with tarred files.

      I'm not sure how this squares with some of the other responses in this thread, though.
Re: Parse a tar.gz file without unzipping and uncompressing unzipping?
by leed25d (Sexton) on Dec 04, 2010 at 16:39 UTC
    Emacs can read and explode compressed tar files (*.tar.gz) when auto-compression-mode is enabled. If you are willing to read some lisp you should be able to locate the .el files without much difficulty.