Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Pulling by regex II

by mkent (Acolyte)
on Dec 13, 2002 at 00:10 UTC ( [id://219471]=perlquestion: print w/replies, xml ) Need Help??

mkent has asked for the wisdom of the Perl Monks concerning the following question:

Hey, guys, thanks!!! This is a wonderful resource, and I incorporated some suggestions into the revised script below. I still have some questions, though!

BrowserUk, I decided against using Date:Manip even though I really like that module. That's because the module instructions warn that it's slower than other time modules and this script will be used most often when the web server is overloaded with requests; thus, speed is essential.

Abigail-II, a database would be nice, but the server is producing regular logs, so that's what I have to use.

In the following script, here are my questions:

1) Using strict produces errors that I don't have a global module loaded; what module is that?

2) The simulated $month switch statement doesn't work as expected; instead of values 0 through 11, it gives everything a value of 1. Getting it changed to a number makes timelocal accurate.

3. At the end, I pack all the referrers into an array; what I need to do is count each referrer as an unique URL, so that www.you.com is counted x times and www.me.com is counted y times so I can then tell the top referrer in the time period stipulated by the web page (which just has hours and minutes to enter). That will let me create output like
www.you.com 22
www.me.com 19
etc
How can I count an unknown value and produce this output
And is an array the best way to do it?

Any and all ideas welcome, and thanks in advance. I really appreciate the help!

Here's the script, followed by some raw log data:

#!/usr/local/bin/perl #use strict; use CGI qw(:standard); use CGI::Carp qw(fatalsToBrowser carpout); use Time::Local; # Grab information returned by web page $hour = param ("hour"); $minute = param ("minute"); # Allow perl to write to browser window print "Content-type: text/html\n\n"; # Current time in seconds $now = time; # Convert submitted time to seconds $compare_time = ($hour * 3600) + ($minute * 60); # Times extracted by logs must be >= to $target $target = $now - $compare_time; open LOGFILE, "datafile.html" || die "Can't open file"; @log_data =<LOGFILE>; # Grab useful information from each line of the web log foreach $log_line(@log_data) { # Grab date/time and referer ($date_string, $referrer) = ($log_line =~ /\([^\]+)\] "^"+"^"+"(^"+ +)"/); # Replace / and : with spaces $date_string =~ s!/! !g; $date_string =~ s!:! !g; # Dump junk at end of line $date_string =~ s! -0-9+!!; # Split date/time into useful information ($day, $month, $year, $hhour, $min, $sec) = split(' ', $date_string +); # Convert month from text to number if ($month == 'Jan') {$month = 0} elsif ($month == 'Feb') {$month = 1} elsif ($month == 'Mar') {$month = 2} elsif ($month == 'Apr') {$month = 3} elsif ($month == 'May') {$month = 4} elsif ($month == 'Jun') {$month = 5} elsif ($month == 'Jul') {$month = 6} elsif ($month == 'Aug') {$month = 7} elsif ($month == 'Sep') {$month = 8} elsif ($month == 'Oct') {$month = 9} elsif ($month == 'Nov') {$month = 10} else {$month = 11} # Calculate time on the log line in seconds $log_time = timelocal($sec,$min,$hhour,$day,$month,$year); if ($log_time >= $target) { push @refers, $referrer; } } <code> Some data: <pre> 216.45.43.42 - - 12/Dec/2002:18:39:15 -0500 "GET /news/opinions/varvel +.gif HTTP/1.1" 302 313 "http://www.freerepublic.com/forum/a3a95ca3c24 +a0.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR +1.0.3705)" 12.222.75.65 - - 12/Dec/2002:18:39:15 -0500 "GET /images/header_aod2_1 +5.gif HTTP/1.1" 200 4162 "http://www.indystar.com/print/articles/1/00 +7735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; W +in 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - 12/Dec/2002:18:39:15 -0500 "GET /images/storysearch2. +gif HTTP/1.1" 200 142 "http://www.indystar.com/print/articles/1/00773 +5-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win +9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - 12/Dec/2002:18:39:15 -0500 "GET /users/ads/misc/remax +_searchad3.gif HTTP/1.1" 200 2335 "http://www.indystar.com/print/arti +cles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Wind +ows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705 +)" 12.222.75.65 - - 12/Dec/2002:18:39:16 -0500 "GET /images/sports_03_aod +.gif HTTP/1.1" 200 3195 "http://www.indystar.com/print/articles/1/007 +735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Wi +n 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - 12/Dec/2002:18:39:16 -0500 "GET /images/email.gif HTT +P/1.1" 200 138 "http://www.indystar.com/print/articles/1/007735-7671- +036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90 +; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - 12/Dec/2002:18:39:16 -0500 "GET /images/print.gif HTT +P/1.1" 200 139 "http://www.indystar.com/print/articles/1/007735-7671- +036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90 +; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - 12/Dec/2002:18:39:16 -0500 "GET /images/sidelinksend2 +.gif HTTP/1.1" 200 1009 "http://www.indystar.com/print/articles/1/007 +735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Wi +n 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - 12/Dec/2002:18:39:16 -0500 "GET /images/pics2/image-0 +07735-7410.jpg HTTP/1.1" 200 18319 "http://www.indystar.com/print/art +icles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Win +dows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.370 +5)" 12.222.75.65 - - 12/Dec/2002:18:39:16 -0500 "GET /images/advertisement +_250strip.gif HTTP/1.1" 200 238 "http://www.indystar.com/print/articl +es/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Window +s 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - 12/Dec/2002:18:39:17 -0500 "GET /users/ads/story/macs +elect/macselect_250_Oct.gif HTTP/1.1" 200 10436 "http://www.indystar. +com/print/articles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; +MSIE 6.0; Windows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NE +T CLR 1.0.3705)"

edited: Fri Dec 13 16:15:35 2002 by jeffa - added <readmore>

Replies are listed 'Best First'.
Re: Pulling by regex II
by tadman (Prior) on Dec 13, 2002 at 00:56 UTC
    Just a few tricks. Don't use chained if statements when a hash will do:
    my %month = ( Jan => 0, Feb => 1, ... ); # ... $month = $month{$month}; # Could stand to use better names
    In many cases it's often better to make a big regex for the entire line. A fully-expanded line-matching regex can break out all the variables you need without any leftover material that needs to be stripped off with substitutions.

    Alternatively, you could use something like Apache::ParseLog.

    Also, since you've left out the brackets on your open call, it doesn't actually error out when you expect it to. The correct way would be:
    open(LOGFILE, "datafile.html") || die "Can't open file";
    Further, you can actually iterate over the log file one line at a time instead of reading it all in:
    foreach my $log_line (<LOGFILE>) { # ... }
    By the way, that commented out #use strict; is scary. The reason you're getting errors is because you're not properly declaring your variables with my. For example:
    my $hour = param ("hour"); my $minute = param ("minute");
      Alternately, you could build a month => month_num hash with a one-line map statement ala:
      my %months = map { (qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec +/)[$_] => $_}(0..11);
      I like doing stuff like this because it saves you the possible typo(s) if you userstand how map works. I'm always looking for more efficient ways to do it, however. Can anyone think of any?

          --jb

        Why use an array and map and a hash. A string and index will do it.

        my $month_num = index( 'Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov De +c', $month_str)/4;

        Examine what is said, not who speaks.

      Thanks, I totally missed the error problem, thanks for pointing it out. Would going through the log one line at a time be more efficient?
        If you have a 2GB log file, you can imagine that reading the whole thing into memory isn't going to be terribly efficient. Processing it one line at a time is virtually a must.

        Unless you need to read in the whole file because it's not terribly large and you will be refering to it several times, it is almost always better to pick through it one line at a time.
Re: Pulling by regex II
by BrowserUk (Patriarch) on Dec 13, 2002 at 00:57 UTC

    BrowserUk, I decided against using Date:Manip even though I really like that module. That's because the module instructions warn that it's slower than other time modules and this script will be used most often when the web server is overloaded with requests; thus, speed is essential.

    Given you are doing a filesystem read between every call to Date::Manip, I seriously doubt it would be a bottleneck. In any case it would be running faster than your broken code:).

    If you really find that Date::Manip is too slow for your application, then you could move to using one of the (several dozen) Date::* modules on CPAN rather than attempting to roll your own. Several of these are mentioned in the Date::Manip pod: Date::Parser, Date::Calc. The later of which is an interface to a C library I believe and its reputedly very fast. The main reason I didn't suggest these is that you need to compile them which is sometimes a problem.

    A quick look at your code and its glaringly obvious what the problem is. And if you hadn't commented out the use strict and omitted the -w/use warnings it would be glaringly obvious to you too, as the compiler would be screaming the cause at you over and over. 11 times I think.

    Come back, if you need to, once you have made your code compile cleanly with -w and use strict and you might get some further help.


    Examine what is said, not who speaks.

      I'm really sorry if I offended you. That wasn't my intention at all and I really appreciate the help. I was reluctant to install Date::Manip, being new to this stuff, but I've done it and now I'm trying to figure out where to specify the log files on my server to test your code. Again, thanks and I didn't mean to offend you.
Re: Pulling by regex II
by pg (Canon) on Dec 13, 2002 at 00:18 UTC
    • For the switch, use eq instead of ==. == operator is for comparing numbers, and eq is for comparing strings. Also better use a hash, instead of this switch, that makes your code more neat. Key for the hash would be 'Jan' etc, and value would be 0 etc.
    • For question 3, instead of array, use hash. The key for the hash is referer, and value is the counter.
      Thanks, eq works. Using == was a real newbie miss. I couldn't quite figure out how to use the hash.
Re: Pulling by regex II
by PhiRatE (Monk) on Dec 13, 2002 at 11:11 UTC
    Right. Lets get down to it. You (and I say you in a friendly, you're-not-the-only-one-everyone-does-this way) have a problem of bad priorities, so, lets list the priorities you have stated so far:

    • Tight deadline
    • Performance
    • Working code

    Now, a few of these don't match. Even more don't add up when you add priorities that you *should* have, notably "secure".

    So, lets detail some of the issues we have with the code thus far:

    • Reading in entire log file (into ram no less) every time the script is run
    • No use of taint mode
    • No use of warnings
    • Bad choice of time comparison algorithm
    • Iffy regexen

    Under these circumstances, without access to help, you should be getting your priorities in order. While it depends on your particular position, the following is a good start:

    1. Working code
    2. Working, secure code
    3. Working, secure code by deadline
    4. Working, secure, maintainable code by deadline
    5. Working, secure, maintainable code that performs well, by deadline

    What does this mean? Well, the first thing it means is that if a well known, already-debugged module is around that can do any part of the problem you're trying to solve, use it. Performance is waaay down on the list, just use it. If it really turns out too slow in use, *then* you worry about it.

    The second thing it means is a quick investigation of language features that can help you out with debugging and security. Taint mode, Strict and Warnings should all be on, and you shouldn't move an inch until your code is working with all of them.

    Now, if you had done this, most of the code above, that isn't working, wouldn't even exist. You'd have something a fraction of the size, that worked fine. Maybe it'd be a bit slow, maybe not, but if it was, you'd be posting to perlmonks saying "Hey, here's my working code, how can I make it go a bit faster", and then people like Abigail-II would have been able to tell you how exactly a database would aid your cause, even when the logs are in text file format. But we're not there yet.

    #!/usr/bin/perl -t use strict; use warnings; use Date::Manip; use CGI qw/:standard/; # Make sure neither we, nor any of our submodules compromise security # by calling unpathed programs. $ENV{PATH} = "/bin:/usr/bin"; $ENV{IFS}=""; # Use CGI to print our headers print header; my %referers = (); # Retrieve and security-check parameters my $hour = param('hour'); my $minute = param('minute'); if ($hour !~ /^\d\d?$/) { die('Invalid hour'); } if ($minute !~ /^\d\d?$/) { die('Invalid minute'); } # Get date object for our check point my $check_date = ParseDate("${hour}hours ${minute}minutes ago"); # File handling, one line at a time open(FH,"file") || die('Could not open log file'); while (my $line = <FH>) { next if ($line !~ /^\S+ \S \S (\S+) \S+ "[^"]+" \d+ \d+ "([^"]+)"/ +); my $line_date = ParseDate($1); # Check to see if the line date is in the range we're after next unless Date_Cmp($line_date, $check_date)>0; # If the referer is new, we set to 1 entry, otherwise increment (i +ncrementing undef doesn't work well) if (!$referers{$2}) { $referers{$2}=1; } else { $referers{$2}++; } } close(FH); my $row = 0; # Sort our referers by the number of hits for (sort {$referers{$b} <=> $referers{$a}} keys %referers) { # break out after the tenth one last if $row++==10; print "$_: ".$referers{$_}."\n"; }

    This is by no means perfect, I didn't have enough log examples to work from to be sure the regex will work well, but it contains all the properties of a good first effort. All the necessary warnings and taints turned on, the relevant modules used (there is, I think, an apache log parsing module too, but I wasn't sure you were using apache), all user input is verified and no more code than strictly necessary is used.

    The end result...no weird month name/number conversions, about 30 seconds of debugging, and a much better product. I highly recommend, prior to your next project, sitting down and listing your priorities as above, it will help focus you on what really matters first, and what can be tweaked second.

      I note that my regex is not entirely correct, my cut&paste of the logs that I used for testing did not contain the square brackets, not sure how that happened, regex should be:

      /^\S+ \S \S \[(\S+) \S+\] "[^"]+" \d+ \d+ "([^"]+)"/
      I believe
      Thanks. I've installed Date::Manip and tried to run your code as a test, but I keep getting "Premature end of script headers" errors. I've checked the path on my server but can't figure out why it's generating these errors. Can you tell me why? Thanks.
        As a hunch, you may need to try switching off taint (the -t switch). Date::Manip may not be handling taint correctly, other than that, no reason that I can think of.
        It's a problem with the output of the script; which is more than likely quite easy to fix. You print your headers using a call to CGI.pm (see the print header; line?), followed by a blank line, and then the page data (<html> etc.).

        You get the "premature end of script headers error" if you didn't output the blank line. That means your problem may be something to do with your print header; line.

        Try reading the CGI spec, the CGI.pm perldoc , and try again.

Re: Pulling by regex II
by rdfield (Priest) on Dec 13, 2002 at 09:11 UTC
      Oh, my. I didn't mean to offend Abigail-II or anyone else, for that matter. I admit to being a total newcomer to both Perl and this list and I really appreciate the help being offered. I apologize deeply if I've been offensive!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://219471]
Approved by pg
Front-paged by pg
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-19 17:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found