baski has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have a large gzip file(>1.2M). I have two patterns in the file:

**pattern1**

#unknown number of lines,/p>

***pattern2**

#unknown number of lines

**pattern1**

#unknown number of lines

**pattern2**

I want to search for each pattern 1 and print the first occurence of pattern 2. I formed an array with all pattern1 matches. Now the task is to print first occurence of pattern2 foreach pattern1. I can only think of a file handler approach which can stop scanning the file after finding the first pattern and then doing a zgrep -m 1 for the first pattern2 occurence from where file pointer stopped scanning. But my files are huge and in gzip format, so this approach wont work. Any ideas on how to go about it?

Appreciate any help out :)

Replies are listed 'Best First'.
Re: Searching a gzip file
by Corion (Patriarch) on Aug 26, 2010 at 08:04 UTC

    How would your approach work with a normal file?

    If you show us the code you have already, then we can likely help you much better on how to approach moving from a normal file to a compressed file. Have you considered just decompressing your compressed file?

    Personally, I'm fond of just using the pipe-open like this:

    my $file = 'some_file.gz'; open my $fh, "gzip -cd $file |" or die "Couldn't read/decompress '$file': $!"; while (<$fh>) { ... };

    This approach also lends itself well to using the flip-flop operator .., see perlop.

      open FILE, '>log_'.$expt_name.'.txt' or die "Can't open file : $!\n"; print FILE "***********************\n"; my @Ids=`zcat $log_files | sed -e 's/\\\n/\\n/g' | grep -B5 'name\?'| + grep Id`; foreach (@Ids){ print $_."\n"; #Search and print the first occurence of <score> after $_ } close FILE;

      There is not much code I have written so far :), trying to figure out which tools to use..thanks

        While I'm not sure why you seem to write a shell script in Perl instead of using the Perl built-in mechanisms, I think if you post about 10 relevant lines of your input data. Your current approach should work regardless of whether the file is compressed or not, so I'm not sure I've understood where you see your problem.

Re: Searching a gzip file
by locked_user sundialsvc4 (Abbot) on Aug 26, 2010 at 13:32 UTC

    I would unzip the file to a temporary location, then process it as a normal file.

    This is the sort of thing that you could easily do with the awk tool, or of course with Perl.   The general approach that I would use is that of a state-machine.

    In this approach, consider that there are exactly four kinds of lines you can be looking at, at any time:

    1. Pattern 1.
    2. Pattern 2.
    3. A line that is neither.
    4. End of file.   (No more lines exist.)
    And there are three states:
    1. The last pattern seen was Pattern 1.
    2. The last pattern seen was Pattern 2.
    3. Neither pattern has been seen yet.  (Initial state.)

    So, the general idea is to imagine a 3x4 rectangular table and to work out, for each square, what the program needs to do.

      That is a very methodical approach and I will try the approach and post my code once its done. If I understand it correctly, this involves unzipping the file and using a file pointer to remember where I last stopped scanning. Is there any other way of doing it without unzipping the file? I am not averse to using awk/sed but from my limited knowledge, I cant seem to come up with a way of achieving this. The point I am getting struck is remmebering the position of pattern 1. Thats why I thought perl will rescue me with file handlers. But anything with file handlers will impose a limitation on the size of file I am dealing with. Also, I would like to point out that each of pettern 1 and pattern 2 are different.

      < name > name 1 < /name >

      #unknoen no of lines

      < id > unique id1 < /id >

      #unknow no of lines

      < name > name 2 < /name >

      #unknoen no of lines

      < id > unique id2 < /id >

      Hi Monks, I need suggestions on the fastest way to do this: I have 3 directories with log files in them. All log files have the following pattern.

      < block>

      < id> xyz < /id>

      < url> foo.com < /url>

      ..

      < response> xyz < /response>

      < /block>

      < block>

      ..

      The task is to get the id if the url is foo.com from logs of first directory , search for that id in all the directories(including first one) print the responses from the corresponding blocks into a saparate file.
      #getting the ids from first directory sub doFile($) { my ($fn) =@_; chomp($fn); print "opening $fn\n"; my $fh = IO::File->new($fn, 'r'); my @msgLines; if( defined $fh){ while(my $l = <$fh>) { push @msgLines, $l; if($l =~ m"</msg>\s*\$") { #my $msg = join('', @msgLines); my $id; if(grep{ m"http://.*foo.com" } @msgLines) { #store the @msglines into an array, this + array can serve as source for searching for reponses from first dire +ctory, need to do something similar for the rest of directories. $id = grep { $_ =~ m"<Id>(\d+)</Id>"; +} #@msgLines; # $id =~ m"<Id>(\d+)</Id>"; push @IDs, $id; } @msgLines = (); } } } else{ die "Cannot open file $!\n";} } my @firstdir=@{$logfiles[0]}; my $path=$logdirs[0]; foreach (@firstdir) { my $curpath=sprintf($path.'/'.$_); print"In foreach trying to open $path\n"; doFile($curpath); }
      The log files are huge, so zipping them into a single file is not possible(out of disk space). Any perl modules that can help me with this task?
Re: Searching a gzip file
by graff (Chancellor) on Aug 27, 2010 at 01:51 UTC
    You may want to check out PerlIO::gzip -- it would allow you to write a script like this:
    #!/usr/bin/perl use strict; use PerlIO::gzip; open( my $zip, "<:gzip", "bigfile.gz ) or die "$!\n"; while (<$zip>) { print if ( /some regex/ ); }
    That example is equivalent to using a shell command like this (having the same benefit of not needing to save an uncompressed version of the big file on disk, even temporarily):
    gunzip < bigfile.gz | grep 'some regex'
    Note that PerlIO::gzip can open output files too, with open( $fh, ">:gzip", "out.gz"), in case you expect to generate a lot of output and want it to be compressed as you go -- that would be equivalent to adding | gzip > out.gz at the end of the shell command line shown above.

    UPDATE: (2010-10-18) It seems that PerlIO::gzip should be viewed as superseded by PerlIO::via:gzip. (see PerlIO::gzip or PerlIO::via::gzip).

Re: Searching a gzip file
by Khen1950fx (Canon) on Aug 26, 2010 at 11:35 UTC
    If you want to search for a pattern, then do what Corion advised---decompress the file; however, you can look at the gzip and count the lines.
    #!/usr/bin/perl use strict; use warnings; my $lines = 0; my $filename = '/path/to/ImageMagick.tar.gz'; die "Can't open '${filename}': $!" unless open(FILE, '<', $filename); while (sysread FILE, my $buffer, 4096) { $lines += ($buffer =~ tr/\n//); } print $lines, "\n";

      But a gzipped file is binary, and counting the newlines in a binary file does not make much sense.