in reply to Re^3: Read multiple text file from bz2 without extract first
in thread Read multiple text file from bz2 without extract first

So if let say I have test.bz2 which contain test.txt which is 1 GB in size, extract test.txt to disk and then process, or directly read test.txt into another text file without extraction consume the same amount of time?
  • Comment on Re^4: Read multiple text file from bz2 without extract first

Replies are listed 'Best First'.
Re^5: Read multiple text file from bz2 without extract first
by Corion (Patriarch) on Mar 27, 2012 at 08:42 UTC

    Why don't you compare the times yourself? The time depends on what is faster, decompressing and reading (CPU), or decompressing+writing+reading (IO). It also depends on whether you need to process the file more than once.

    From Perl, you can directly decompress and read by using the pipe-open:

    open my $fh, "bzip -cd $file |" or die "Couldn't open '$file': $!";

    That is efficient if you only need to read the data once. If you need to read it more than once and have the disk space needed, decompressing once and then reading the decompressed file is likely faster.

      Corion, when I run your code, it prompts that "bzip is not an internal command...." What am I missing? Actually I would like to read a txt inside a bz2 without extracting first and pattern matching the content with some keyword and output the result to an array or text file.
        I'm never sure whether the program name is bzip or bzip2. Use whatever the name of the program is. Also, you will need to have that program available in the PATH.
Re^5: Read multiple text file from bz2 without extract first
by mbethke (Hermit) on Mar 27, 2012 at 22:10 UTC
    I would think that decompressing while reading would be faster, but as Corion said, it depends. Usually bzip2 compresses text files very well so the IO load is much less if you don't write the decompressed text back to disk. If however you need to read the file several times or seek around in it, it may be worth writing it to disk. A gigabyte of text on a modern machine has a good chance of staying largely in the file system cache so reading it again is mostly at RAM speed.