drblove27 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am trying to open very very large .gz files, but trying to avoid taking a memory hit. Specifically I am running this code:

#!/usr/bin/perl -w use strict; use Archive::Tar; my @here = Archive::Tar->list_archive("filename.tar.gz"); foreach (@here) { print "$_\n"; } print "Took $time_taken seconds to process\n";

My problem is that the tar.gz file is is 64MB, but the contents are roughly 4GB.

I guess what I should say what I really want. Each of these tar.gz files have just a single file that I need to extract that can be very large. I don't really care about the name inside the tar.gz file, I just want to extract it. Since the file is so large, I would like to extract the file, minimizing the memory hit, so I can then read the extracted contents in, line by line to do the processing that I need to do.

I really want to avoid trying to stick the whole file in memory. Is there a smarter way to do this? Or at the very least minimize the memory hit that I am going to take?

Here is my "best" attempt at this... (this is on a Windows system for *nix tricks won't save me...)

#!/usr/bin/perl -w use strict; use Archive::Tar; my $filename = "filename.tar.gz"; my $tar = Archive::Tar->new($filename); my @filenames = $tar->list_files; $tar->extract_file($filenames[0],'temp.out'); open(DATA, "temp.out"); while (<DATA>) { ... } close(DATA);
Thanks in advance!

Replies are listed 'Best First'.
Re: File list from gz file without reading everything into memory
by Fletch (Bishop) on Nov 18, 2009 at 20:13 UTC

    You're probably not going to have much luck because tar.gz archives don't have a separate table-of-contents, but rather are just streamed one after the other with a header prepended (just like it was being written to, erm, tape; go figure . . .). On top of that, the entire archive is compressed, so in order to get to a file in the middle you've got to read along and uncompress until you find the file in the archive you're interested in.

    The easiest thing (if you can wrangle it and want to avoid unnecessary overhead) would be to get whomever's providing you the file to switch to another format (e.g. zip) which has a separate index and individual parts can be easily extracted piecemeal.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Dang... Will go back to the source and see if there is something better to do than that...

      If .tar.gz file is just a compression of a single file, is there a memory efficient way to extract it, almost like streaming it into the output file? Or in essence that is what the code is doing already?

      Thanks again for your reply, I will check back with the generator of these files to see if I can try another compression approach...

Re: File list from gz file without reading everything into memory
by jethro (Monsignor) on Nov 18, 2009 at 21:43 UTC

      From Archive::Tar on Archive::Tar->iter Class method.

      Returns an iterator function that reads the tar file without loading it all in memory. Each time the function is called it will return the next file in the tarball. The files are returned as Archive::Tar::File objects. The iterator function returns the empty list once it has exhausted the the files contained.

      From FAQ:

      Isn't Archive::Tar heavier on memory than /bin/tar?

      ... If you just want to extract, use the extract_archive class method instead. It will optimize and write to disk immediately. ...

      Maybe this answers your question.

        So here is the code that I tried:
        use strict; use Archive::Tar; my $filename = "filename.tar.gz"; Archive::Tar->extract_archive($filename);
        As near as I can tell it just loaded (tried) into memory so it could then write out, but it ran out of memory before then. I think what this comment means to say is that if your tar file includes multiple files, it will not read them all into memory. Instead it will load a single file into memory, then write it out, then read the next file into memory, then write it out, etc.

        Where I am looking to try and do something like process binary chunk write it to file, remove binary chunk from memory, repeat until done. As far as I can tell this pm, does not do that... If anyone knows better, I would really appreciate it.

        Otherwise I am checking out really thin un-tar programs, like the ones mentioned in the comments to this post that I can package with my program, and just make a system call to them to handle the untar...

        If anyone has any other idea, I would love to hear them. I am stuck with this file compression since it is the output of the machine.

Re: File list from gz file without reading everything into memory
by ikegami (Patriarch) on Nov 18, 2009 at 20:43 UTC
    open(my $fh, '-|', tar => 'tzf', $archive) or die; chomp( my @files = <$fh> );

    It'll have to uncompress the whole file, but it shouldn't keep any of it in memory

    Update: Doh! I just noticed you're on Windows. By that, I presume you mean you don't have the tar tool.

      Yeah no tar tool.. though I am sure I can find something with a command line for windows that I could call out too... When I use the code that I provided my system runs out of memory, i.e. I can not un-tar.gz this file using the Archive::Tar tool...

        It might not be an option for you but it looks like tar is included with Cygwin. Might be worth a try.

        Cheers,

        JohnGG

Re: File list from gz file without reading everything into memory
by salva (Canon) on Nov 19, 2009 at 09:44 UTC
    The tar file format is quite simple, a 512 bytes header followed by the file contents also padded to a 512 boundary (see Tar_(file_format)).

    As you only have one file per archive...

    1. open the file with the corresponding :gzip layer
    2. read the header and extract the file size from there
    3. read lines taking care to remove the extra nulls from the last line on the file
Re: File list from gz file without reading everything into memory
by stefbv (Priest) on Nov 19, 2009 at 09:05 UTC
    Each of these tar.gz files have just a single file

    To simplify a little more, in this case tar is not needed to compress the files, gzip or bzip2 can be used alone