Re: Searching a gzip file
by Corion (Patriarch) on Aug 26, 2010 at 08:04 UTC
|
How would your approach work with a normal file?
If you show us the code you have already, then we can likely help you much better on how to approach moving from a normal file to a compressed file. Have you considered just decompressing your compressed file?
Personally, I'm fond of just using the pipe-open like this:
my $file = 'some_file.gz';
open my $fh, "gzip -cd $file |"
or die "Couldn't read/decompress '$file': $!";
while (<$fh>) {
...
};
This approach also lends itself well to using the flip-flop operator .., see perlop. | [reply] [d/l] [select] |
|
|
open FILE, '>log_'.$expt_name.'.txt' or die "Can't open file : $!\n";
print FILE "***********************\n";
my @Ids=`zcat $log_files | sed -e 's/\\\n/\\n/g' | grep -B5 'name\?'|
+ grep Id`;
foreach (@Ids){
print $_."\n";
#Search and print the first occurence of <score> after $_
}
close FILE;
There is not much code I have written so far :), trying to figure out which tools to use..thanks | [reply] [d/l] |
|
|
While I'm not sure why you seem to write a shell script in Perl instead of using the Perl built-in mechanisms, I think if you post about 10 relevant lines of your input data. Your current approach should work regardless of whether the file is compressed or not, so I'm not sure I've understood where you see your problem.
| [reply] |
Re: Searching a gzip file
by locked_user sundialsvc4 (Abbot) on Aug 26, 2010 at 13:32 UTC
|
I would unzip the file to a temporary location, then process it as a normal file.
This is the sort of thing that you could easily do with the awk tool, or of course with Perl. The general approach that I would use is that of a state-machine.
In this approach, consider that there are exactly four kinds of lines you can be looking at, at any time:
- Pattern 1.
- Pattern 2.
- A line that is neither.
- End of file. (No more lines exist.)
And there are three states:
- The last pattern seen was Pattern 1.
- The last pattern seen was Pattern 2.
- Neither pattern has been seen yet. (Initial state.)
So, the general idea is to imagine a 3x4 rectangular table and to work out, for each square, what the program needs to do.
| |
|
|
That is a very methodical approach and I will try the approach and post my code once its done. If I understand it correctly, this involves unzipping the file and using a file pointer to remember where I last stopped scanning. Is there any other way of doing it without unzipping the file? I am not averse to using awk/sed but from my limited knowledge, I cant seem to come up with a way of achieving this. The point I am getting struck is remmebering the position of pattern 1. Thats why I thought perl will rescue me with file handlers. But anything with file handlers will impose a limitation on the size of file I am dealing with. Also, I would like to point out that each of pettern 1 and pattern 2 are different.
< name > name 1 < /name >
#unknoen no of lines
< id > unique id1 < /id >
#unknow no of lines
< name > name 2 < /name >
#unknoen no of lines
< id > unique id2 < /id >
| [reply] |
|
|
#getting the ids from first directory
sub doFile($) {
my ($fn) =@_;
chomp($fn);
print "opening $fn\n";
my $fh = IO::File->new($fn, 'r');
my @msgLines;
if( defined $fh){
while(my $l = <$fh>) {
push @msgLines, $l;
if($l =~ m"</msg>\s*\$") {
#my $msg = join('', @msgLines);
my $id;
if(grep{ m"http://.*foo.com" } @msgLines) {
#store the @msglines into an array, this
+ array can serve as source for searching for reponses from first dire
+ctory, need to do something similar for the rest of directories.
$id = grep { $_ =~ m"<Id>(\d+)</Id>";
+}
#@msgLines;
# $id =~ m"<Id>(\d+)</Id>";
push @IDs, $id;
}
@msgLines = ();
}
}
}
else{
die "Cannot open file $!\n";}
}
my @firstdir=@{$logfiles[0]};
my $path=$logdirs[0];
foreach (@firstdir) {
my $curpath=sprintf($path.'/'.$_);
print"In foreach trying to open $path\n";
doFile($curpath);
}
The log files are huge, so zipping them into a single file is not possible(out of disk space). Any perl modules that can help me with this task? | [reply] [d/l] |
Re: Searching a gzip file
by graff (Chancellor) on Aug 27, 2010 at 01:51 UTC
|
You may want to check out PerlIO::gzip -- it would allow you to write a script like this:
#!/usr/bin/perl
use strict;
use PerlIO::gzip;
open( my $zip, "<:gzip", "bigfile.gz ) or die "$!\n";
while (<$zip>) {
print if ( /some regex/ );
}
That example is equivalent to using a shell command like this (having the same benefit of not needing to save an uncompressed version of the big file on disk, even temporarily):
gunzip < bigfile.gz | grep 'some regex'
Note that PerlIO::gzip can open output files too, with open( $fh, ">:gzip", "out.gz"), in case you expect to generate a lot of output and want it to be compressed as you go -- that would be equivalent to adding | gzip > out.gz at the end of the shell command line shown above.
UPDATE: (2010-10-18) It seems that PerlIO::gzip should be viewed as superseded by PerlIO::via:gzip. (see PerlIO::gzip or PerlIO::via::gzip). | [reply] [d/l] [select] |
Re: Searching a gzip file
by Khen1950fx (Canon) on Aug 26, 2010 at 11:35 UTC
|
If you want to search for a pattern, then do what Corion advised---decompress the file; however, you can look at the gzip and count the lines.
#!/usr/bin/perl
use strict;
use warnings;
my $lines = 0;
my $filename = '/path/to/ImageMagick.tar.gz';
die "Can't open '${filename}': $!" unless open(FILE, '<', $filename);
while (sysread FILE, my $buffer, 4096) {
$lines += ($buffer =~ tr/\n//);
}
print $lines, "\n";
| [reply] [d/l] |
|
|
| [reply] |