Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

parse gzipped weblogs

by vili (Monk)
on Jul 22, 2003 at 01:14 UTC ( [id://276553]=perlquestion: print w/replies, xml ) Need Help??

vili has asked for the wisdom of the Perl Monks concerning the following question:

Hello PerL adepts. Mine is a newbie problem. I did spend quite a few hours looking for the solution. to no avail.
Let me just say, that it is a priviledge to be in the company of merlyn. And I think that this site is great.
I need to go through seven directories that contain tons of gz logs, I haven't been able to figure out a way to have my program process the gzipped files, currently i give a timestamp as an argument, and all logs from prospective directories with matching timestamp in the name of the file get looked at, they are not compressed, eventually i'll have to give a starting period and an ending period and make it so it's real easy to use for the marketing folk. this is my relevant code:


@hostnames = qw(baloo beast belle chip cogsworth lefou potts); my $timestamp=$ARGV[0]; for ($c=0; $c<@hostnames; $c++){ my $logpath = "/archive/$hostnames[$c]/www/logs/access_log.$timestamp" +; #print "$logpath\n"; open (LOGS,"$logpath") || die "cant open $logpath\n"; while (<LOGS>){ regex{ get valuable data here; } }
my questions are several:

1. how do i make this process the gz files? (i tried using * in place of the $timestamp, but that wasnt working, i wonder if i'll have to deflate and then parse(which will take ages))

2. what would be the most elegant way to do the above(1)

3. in your oppinion what is the simplest easiest to use time date parser, and what would be a good way to implement it.

Thank's to everyone that took the time to read this longwinded post. i'd be thrilled to get your stance on this. thanks for the help.


~vili
roaming the perl labyrinth

Replies are listed 'Best First'.
Re: parse gzipped weblogs
by Kageneko (Scribe) on Jul 22, 2003 at 01:42 UTC
    Coincidentally, I am in the middle of working with gzipped logs myself. Perl, by itself, does not "understand" gzip files. You must make it understand by using Compress::Zlib or IO::Zlib. I prefer the latter, as it emulates the IO::* interface and makes your code a lot easier to work with. That said, you can do something like this:
    use strict; use warnings; use IO::Zlib; # also uses Compress::Zlib my @hostnames = qw(baloo beast belle chip cogsworth lefou potts); my $timestamp = $ARGV[0]; foreach my $host ( @hostnames ) { my $logpath = "/archive/$host/www/logs/access_log.$timestamp"; my $fh = IO::Zlib->new($logpath, "r"); die "Cannot open $logpath: $!" if !$fh; # Probably good to check Compress::Zlib::gzerrno, too. while ( <$fh> ) { # do text processing here }; $fh->close; }
    As for your third question, I'm not quite sure what you mean. I tend to use Date::Manip for all of my happy-fun date processing.
Re: parse gzipped weblogs
by zengargoyle (Deacon) on Jul 22, 2003 at 02:05 UTC

    you can put your code in <code></code> tags to keep it nice and easy to read. and you can just decompress on the fly if you need to. you should check the timestamp to make sure somebody doesn't try '20030201;rm -rf /;' or something else nasty.

    @hostnames = qw(baloo beast belle chip cogsworth lefou potts); my $timestamp=$ARGV[0]; for my $hostname (@hostnames){ my $logpath = "/archive/$hostname/www/logs/access_log.$timestamp"; #print "$logpath\n"; ## change this open (LOGS,"/path/to/gzip -d -c $logpath.gz |") or die "cant open $logpath\n"; while (<LOGS>){ regex{ get valuable data here; } } }
Re: parse gzipped weblogs
by Dog and Pony (Priest) on Jul 22, 2003 at 03:05 UTC
    You have been given answers to your questions as stated, I'm just curious about what it is that you want/need to do? It could be that you are trying to extract some information that is special to your company or some such, but if you are looking to extract normal website statistics, then maybe you shouldn't roll your own. Instead look at programs such as Analog which is highly configurable, and free.

    Just thought I'd mention it, in case that might be a simpler and faster solution. Parsing and crossreferencing weblogs is no fun task, and the logs tend to be quite big pretty fast so it is also an intensive labour for the program. Of course it is possible (and as such, easier than in most languages) to do in Perl, but it might not be the best tool for the job. :)


    You have moved into a dark place.
    It is pitch black. You are likely to be eaten by a grue.
      Thank you to all that replied to my post. Your time/input is appreciated.
      Noted on the tags zengargoyle.
      Dog and Pony, I've already looked at analog, despite its amazing thoroughness, I cant use it for the specific information that is required of me.
      Thank you Kageneko
      I'm sure I'll have more questions eventualy. ~vili

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://276553]
Approved by cLive ;-)
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2024-04-19 16:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found