in reply to pulling by regex

So, you are reading your weblogs over and over, once for each request? That's not very efficient. Why not dump the log data into a database, and query the database? You could have the database do most of the work, including finding the 10 top.

Abigail

Replies are listed 'Best First'.
Re: Re: pulling by regex
by mkent (Acolyte) on Dec 13, 2002 at 00:02 UTC
    Hey, guys, thanks!!! This is a wonderful resource, and I incorporated some suggestions into the revised script below. I still have some questions, though!

    BrowserUk, I decided against using Date:Manip even though I really like that module. That's because the module instructions warn that it's slower than other time modules and this script will be used most often when the web server is overloaded with requests; thus, speed is essential.

    Abigail-II, a database would be nice, but the server is producing regular logs, so that's what I have to use.

    In the following script, here are my questions:

    1) Using strict produces errors that I don't have a global module loaded; what module is that?

    2) The simulated $month switch statement doesn't work as expected; instead of values 0 through 11, it gives everything a value of 1. Getting it changed to a number makes timelocal accurate.

    3. At the end, I pack all the referrers into an array; what I need to do is count each referrer as an unique URL, so that www.you.com is counted x times and www.me.com is counted y times so I can then tell the top referrer in the time period stipulated by the web page (which just has hours and minutes to enter). That will let me create output like
    www.you.com 22
    www.me.com 19
    etc
    How can I count an unknown value and produce this output? And is an array the best way to do it?

    Any and all ideas welcome, and thanks in advance. I really appreciate the help!

    Here's the script, followed by some raw log data:

    #!/usr/local/bin/perl #use strict; use CGI qw(:standard); use CGI::Carp qw(fatalsToBrowser carpout); use Time::Local; # Grab information returned by web page $hour = param ("hour"); $minute = param ("minute"); # Allow perl to write to browser window print "Content-type: text/html\n\n"; # Current time in seconds $now = time; # Convert submitted time to seconds $compare_time = ($hour * 3600) + ($minute * 60); # Times extracted by logs must be >= to $target $target = $now - $compare_time; open LOGFILE, "datafile.html" || die "Can't open file"; @log_data =<LOGFILE>; # Grab useful information from each line of the web log foreach $log_line(@log_data) { # Grab date/time and referer ($date_string, $referrer) = ($log_line =~ /\[([^\]]+)\] "[^"]+"[^"] ++"([^"]+)"/); # Replace / and : with spaces $date_string =~ s!/! !g; $date_string =~ s!:! !g; # Dump junk at end of line $date_string =~ s! -[0-9]+!!; # Split date/time into useful information ($day, $month, $year, $hhour, $min, $sec) = split(' ', $date_string +); # Convert month from text to number if ($month == 'Jan') {$month = 0} elsif ($month == 'Feb') {$month = 1} elsif ($month == 'Mar') {$month = 2} elsif ($month == 'Apr') {$month = 3} elsif ($month == 'May') {$month = 4} elsif ($month == 'Jun') {$month = 5} elsif ($month == 'Jul') {$month = 6} elsif ($month == 'Aug') {$month = 7} elsif ($month == 'Sep') {$month = 8} elsif ($month == 'Oct') {$month = 9} elsif ($month == 'Nov') {$month = 10} else {$month = 11} # Calculate time on the log line in seconds $log_time = timelocal($sec,$min,$hhour,$day,$month,$year); if ($log_time >= $target) { push @refers, $referrer; } }

    Some data:

    216.45.43.42 - - [12/Dec/2002:18:39:15 -0500] "GET /news/opinions/varv +el.gif HTTP/1.1" 302 313 "http://www.freerepublic.com/forum/a3a95ca3c +24a0.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CL +R 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /images/header_aod2 +_15.gif HTTP/1.1" 200 4162 "http://www.indystar.com/print/articles/1/ +007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; + Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /images/storysearch +2.gif HTTP/1.1" 200 142 "http://www.indystar.com/print/articles/1/007 +735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Wi +n 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /users/ads/misc/rem +ax_searchad3.gif HTTP/1.1" 200 2335 "http://www.indystar.com/print/ar +ticles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Wi +ndows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.37 +05)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/sports_03_a +od.gif HTTP/1.1" 200 3195 "http://www.indystar.com/print/articles/1/0 +07735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; +Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/email.gif H +TTP/1.1" 200 138 "http://www.indystar.com/print/articles/1/007735-767 +1-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4. +90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/print.gif H +TTP/1.1" 200 139 "http://www.indystar.com/print/articles/1/007735-767 +1-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4. +90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/sidelinksen +d2.gif HTTP/1.1" 200 1009 "http://www.indystar.com/print/articles/1/0 +07735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; +Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/pics2/image +-007735-7410.jpg HTTP/1.1" 200 18319 "http://www.indystar.com/print/a +rticles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; W +indows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3 +705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/advertiseme +nt_250strip.gif HTTP/1.1" 200 238 "http://www.indystar.com/print/arti +cles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Wind +ows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705 +)" 12.222.75.65 - - [12/Dec/2002:18:39:17 -0500] "GET /users/ads/story/ma +cselect/macselect_250_Oct.gif HTTP/1.1" 200 10436 "http://www.indysta +r.com/print/articles/1/007735-7671-036.html" "Mozilla/4.0 (compatible +; MSIE 6.0; Windows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; . +NET CLR 1.0.3705)"

    update (broquaint): changed <pre> tags to <code> tags

Re: Re: pulling by regex
by mkent (Acolyte) on Dec 15, 2002 at 21:01 UTC
    On reflection, that's a good idea, using MySQL. But wouldn't it waste time overwriting the same database each time the script is called, since there would be no point in keeping the old data? As I would envision this, translate to a date string plus the referrer and send them both to MySQL in two fields. Then process the input from the web page and use that information to pull from the database. What do you think?
      I think dumping the information of the log into a database each time the script is run is pretty stupid, and defeating the benefits. What's in the database is in the database, and doesn't have to be inserted again. Just dump the new logs to a database on a regular basis.

      Abigail