Re^4: Read multiple text file from bz2 without extract first

Replies are listed 'Best First'.
Re^5: Read multiple text file from bz2 without extract first by Corion (Patriarch) on Mar 27, 2012 at 08:42 UTC
Why don't you compare the times yourself? The time depends on what is faster, decompressing and reading (CPU), or decompressing+writing+reading (IO). It also depends on whether you need to process the file more than once. From Perl, you can directly decompress and read by using the pipe-open: `open my $fh, "bzip -cd $file \|" or die "Couldn't open '$file': $!";` [download] That is efficient if you only need to read the data once. If you need to read it more than once and have the disk space needed, decompressing once and then reading the decompressed file is likely faster.	[reply] [d/l]
Re^6: Read multiple text file from bz2 without extract first by prescott2006 (Acolyte) on Apr 03, 2012 at 03:23 UTC
Corion, when I run your code, it prompts that "bzip is not an internal command...." What am I missing? Actually I would like to read a txt inside a bz2 without extracting first and pattern matching the content with some keyword and output the result to an array or text file.	[reply]
Re^7: Read multiple text file from bz2 without extract first by Corion (Patriarch) on Apr 03, 2012 at 05:28 UTC
I'm never sure whether the program name is bzip or bzip2. Use whatever the name of the program is. Also, you will need to have that program available in the PATH.	[reply]
Re^5: Read multiple text file from bz2 without extract first by mbethke (Hermit) on Mar 27, 2012 at 22:10 UTC
I would think that decompressing while reading would be faster, but as Corion said, it depends. Usually bzip2 compresses text files very well so the IO load is much less if you don't write the decompressed text back to disk. If however you need to read the file several times or seek around in it, it may be worth writing it to disk. A gigabyte of text on a modern machine has a good chance of staying largely in the file system cache so reading it again is mostly at RAM speed.	[reply]