Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Log munging optimization tips?

by mkent (Acolyte)
on Dec 19, 2002 at 22:57 UTC ( [id://221279]=perlquestion: print w/replies, xml ) Need Help??

mkent has asked for the wisdom of the Perl Monks concerning the following question:

This is an update to see if anyone can suggest ways to speed up this script, which tends to be rather slow:

use strict; use warnings; use Date::Manip; use CGI qw/:standard/; # Make sure security is not compromised by calling unpathed programs. $ENV{PATH} = "/bin:/usr/bin:/usr/local/bin:"; $ENV{IFS}=""; # Use CGI to print the header print header; # Make variables local only my %referers = (); my $row = 0; my $counter = 0; # Retrieve and security-check parameters my $site = param('site'); my $hour = param('hour'); my $minute = param('minute'); if ($hour !~ /^\d\d?$/) { die('Invalid hour'); } if ($minute !~ /^\d\d?$/) { die('Invalid minute'); } # Get date object for the checkpoint my $check_date = ParseDate("${hour}hours ${minute}minutes ago"); # Select the server log - current 12/19/02 my $data = ''; if ($site eq 'star') {$data = 'indystar/access_log'} elsif ($site eq 'topics') {$data = 'topics/access_log'} else {$data = 'noblesville/access_log'} # Create headline for web page print "<h1>Referrers in the past $hour hours and $minute minutes</h1>" +; # File handling, one line at a time; if can't open, say why open(FH,"$data") || die('Could not open $data: $1'); while (defined(my $line = <FH>) ) { next if ($line !~ /^\S+ \S \S \[(\S+) \S+\] "[^"]+" \d+ \d+ "([^"] ++)"/); my $line_date = ParseDate($1); # Check to see if the line date is in the range we're after next unless Date_Cmp($line_date, $check_date)>0; # If the referer is new, set to 1 entry, otherwise increment if (not exists $referers{$2}) { $referers{$2}=1; } else { $referers{$2}++; } } close(FH); # Sort and print for (sort {$referers{$b} <=> $referers{$a}} keys %referers) { if ($counter <= 10) { open (FILE, ">storage1.txt") || die('Could not open $storage1.t +xt: $1'); print "$_ - $referers{$_}<p>"; print FILE "$_ - $referers{$_}<p>"; ++$counter; } elsif ($counter > 10 && $counter <= 20) { if ($counter == 11) { print "<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;< +a href=\"storage2.txt\"><font color=\"FF0000\">Next</font></a><br>"; print FILE "<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n +bsp;<a href=\"storage2.txt\"><font color=\"FF0000\">Next</font></a><b +r>"; open (FILE2, ">storage2.txt") || die('Could not open $stora +ge2.txt: $1'); print FILE2 "$_ - $referers{$_}<p>"; ++$counter; } } elsif ($counter > 20 && $counter <= 30) { if ($counter == 21) { print "<p>a href=\"storage2.txt\"><font color=\"FF0000\">Pr +evious</font></a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a h +ref=\"storage3.txt\"><font color=\"FF0000\">Next</font></a><br>"; open (FILE3, ">storage3.txt") || die('Could not open $stora +ge3.txt: $1'); print FILE3 "$_ - $referers{$_}<p>"; ++$counter; } } elsif ($counter > 30 && $counter <= 40) { if ($counter == 31) { print "<p>a href=\"storage3.txt\"><font color=\"FF0000\">Pr +evious</font></a><br>"; open (FILE4, ">storage4.txt") || die('Could not open $stora +ge4.txt: $1'); print FILE4 "$_ - $referers{$_}<p>"; } } } close FILE; close FILE2; close FILE3; close FILE4;
Edit davorg - changed title

Replies are listed 'Best First'.
Re: Speed
by Elian (Parson) on Dec 19, 2002 at 23:07 UTC
    Well, the print logic seems somewhat screwy, but if it does what you want, that's fine.

    Have you considered the possibility that you're just chewing through a lot of data and that takes time? If you do this regularly you might consider having the log entries automatically thrown into an SQL database and have the display program here just extract and display things? You'll likely get better performance in the display if you do some of the processing in a separate step...

Re: Speed
by MarkM (Curate) on Dec 20, 2002 at 05:24 UTC

    There are two reasons why your script is slow:

    1. You are using Date::Manip to parse the date strings of every record. Date::Manip is very slow.
    2. As mentioned by somebody else - you are scanning all log entries at every invocation. Putting this together with the fact that every log entry is parsed using Date::Manip, and you have the perfect recipe for turtle-sickness. Instead of rescanning and skipping log entries before a certain cut-off time, maintain an offset into the file of the 'last log entry processed' and use the seek() system call to resume from where you left off. If possible, avoid Date::Manip or any time processing completely as it will often result in unnecessary system calls, or else CPU intensive processing.

    Also mentioned by somebody else: monks love to help other monks. However, monks do not like to have their freely offered gifts abused. You do need to do your own research once in a while, and you do need to think of monks as more than a resource that can be exploited to allow you to get your job done without thinking for yourself. This comment is not made with disrespect towards you. It is made as a request that the valuable resource known as 'the monks' is respected.

Re: Speed
by UnderMine (Friar) on Dec 20, 2002 at 00:34 UTC
    This appears to be more a miss-understanding the problem than a very bad script.

    As far as I can see you are not rotating your logs therefor your script is scanning EVERY HIT your website has EVER had before it can analyse the last ten minutes.

    Rotating logs is vital if you wish to do any kind of serious log analysis. My current record is having to rotate logs every 6 hours on a single website as files grow at over 1Mb/min. Multiple vhosts should always log to different files which you appear to be doing otherwise analysis takes a lot longer.

    I would personally ask why you feel the need to write your own web based analysis system as Analog would probably do the job. If you do feel the need rotating logs daily can be a boon as you then know exactly which logs are needed for a particular set of days from the file name.

    One method I have used to generate such realtime reporting is to use a daemon to run short periodic reports and store the stats in a database. Your script then processes the summary stats and not the raw log files.

    Done correctly you can get very impressive results with these methods. I have had a system happily doing analysis on over 100M hits in sub 5 seconds but you need to really understand every possible shortcut.

    However on the script side.. when counter=31 you have no counter increments happening. so you have the file be opened and reopened.

    Hope it help
    UnderMine

Re: Slow Process
by tadman (Prior) on Dec 19, 2002 at 23:41 UTC
    Unfortunately, I have to agree with BrowserUk's earlier remarks about this recent rash of postings that keep pecking away at this problem, one minor thing at a time (count sort & output II, count sort & output, Pulling by regex II, pulling by regex). Asking for help is one thing, and is certainly encouraged, but there is an expectation that one would at least try to help themselves as well.

    If you are having this much trouble with Perl, you should either hire a programmer or find some training materials. Posting every minor improvement to your script is not very productive.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://221279]
Approved by Elian
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-18 21:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found