in reply to Searching a gzip file

I would unzip the file to a temporary location, then process it as a normal file.

This is the sort of thing that you could easily do with the awk tool, or of course with Perl.   The general approach that I would use is that of a state-machine.

In this approach, consider that there are exactly four kinds of lines you can be looking at, at any time:

  1. Pattern 1.
  2. Pattern 2.
  3. A line that is neither.
  4. End of file.   (No more lines exist.)
And there are three states:
  1. The last pattern seen was Pattern 1.
  2. The last pattern seen was Pattern 2.
  3. Neither pattern has been seen yet.  (Initial state.)

So, the general idea is to imagine a 3x4 rectangular table and to work out, for each square, what the program needs to do.

Replies are listed 'Best First'.
Re^2: Searching a gzip file
by baski (Novice) on Aug 26, 2010 at 15:08 UTC
    That is a very methodical approach and I will try the approach and post my code once its done. If I understand it correctly, this involves unzipping the file and using a file pointer to remember where I last stopped scanning. Is there any other way of doing it without unzipping the file? I am not averse to using awk/sed but from my limited knowledge, I cant seem to come up with a way of achieving this. The point I am getting struck is remmebering the position of pattern 1. Thats why I thought perl will rescue me with file handlers. But anything with file handlers will impose a limitation on the size of file I am dealing with. Also, I would like to point out that each of pettern 1 and pattern 2 are different.

    < name > name 1 < /name >

    #unknoen no of lines

    < id > unique id1 < /id >

    #unknow no of lines

    < name > name 2 < /name >

    #unknoen no of lines

    < id > unique id2 < /id >

Best way to search through blocks of data
by baski (Novice) on Sep 11, 2010 at 02:57 UTC
    Hi Monks, I need suggestions on the fastest way to do this: I have 3 directories with log files in them. All log files have the following pattern.

    < block>

    < id> xyz < /id>

    < url> foo.com < /url>

    ..

    < response> xyz < /response>

    < /block>

    < block>

    ..

    The task is to get the id if the url is foo.com from logs of first directory , search for that id in all the directories(including first one) print the responses from the corresponding blocks into a saparate file.
    #getting the ids from first directory sub doFile($) { my ($fn) =@_; chomp($fn); print "opening $fn\n"; my $fh = IO::File->new($fn, 'r'); my @msgLines; if( defined $fh){ while(my $l = <$fh>) { push @msgLines, $l; if($l =~ m"</msg>\s*\$") { #my $msg = join('', @msgLines); my $id; if(grep{ m"http://.*foo.com" } @msgLines) { #store the @msglines into an array, this + array can serve as source for searching for reponses from first dire +ctory, need to do something similar for the rest of directories. $id = grep { $_ =~ m"<Id>(\d+)</Id>"; +} #@msgLines; # $id =~ m"<Id>(\d+)</Id>"; push @IDs, $id; } @msgLines = (); } } } else{ die "Cannot open file $!\n";} } my @firstdir=@{$logfiles[0]}; my $path=$logdirs[0]; foreach (@firstdir) { my $curpath=sprintf($path.'/'.$_); print"In foreach trying to open $path\n"; doFile($curpath); }
    The log files are huge, so zipping them into a single file is not possible(out of disk space). Any perl modules that can help me with this task?