Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

pulling by regex

by mkent (Acolyte)
on Dec 10, 2002 at 23:50 UTC ( [id://218961]=perlquestion: print w/replies, xml ) Need Help??

mkent has asked for the wisdom of the Perl Monks concerning the following question:

I'm a newbie with a deadline, so any help appreciated. I need to write a program with a web interface where periods of time can be specified (last 2 hours, last 24 hours, last 2 hours and 15 mins)and then the web log read to fetch all entries matching that time period, find the referrers and add them up to display the top 10 referrers, in order.

As a first step, I'm trying to pull out the time and http referrer from web log data, but it's not going well since the only way I can see to do it is to strip out the unwanted parts of the log line and then use the timelocal function to convert the log time to real time to match whatever math is done to the current time. Here's what I have so far as a test:

#!/usr/local/bin/perl use CGI qw(:standard); use CGI::Carp qw(fatalsToBrowser carpout); use Time::Local; print "Content-type: text/html\n\n"; #$time = timelocal($sec,$min,$hour,$mday,$mon,$year); open LOGFILE, "datafile.html"; @log_data = <LOGFILE>; foreach $log_line(@log_data) { $log_line =~ s/.*(left square bracket)/ /; $log_line =~ s/"GET.*"h/ /; $log_line =~ s/".*/ /; print $log_line, "<p>"; } <p>
The last $log_line does not work.

The datafile.html contains data in this form (square brackets are around the underlined date/times):

24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/header_ao +d2_08.gif HTTP/1.0" 200 663 "http://www.indystar.com/help/help/availa +ble.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)" 24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/header_ao +d2_10.gif HTTP/1.0" 304 - "http://www.indystar.com/help/help/availabl +e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)" 24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/storysear +ch2.gif HTTP/1.0" 200 142 "http://www.indystar.com/help/help/availabl +e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"

Replies are listed 'Best First'.
Re: pulling by regex
by BrowserUk (Patriarch) on Dec 11, 2002 at 03:04 UTC

    This may help get you started. Incorporating this into a CGI.pm script is left as AEFTR. (Hint: There's not much point in useing CGI; if your going to produce the html yourself.)

    Using Date:Manip makes the date calculation part easy, (though the verbose but entirely opaque documentation has me gritting my teeth and banging my head every time). The regex I've used may not be robust, but there are plenty of other offers above to choose from.

    #! perl -slw use strict; use Date::Manip; use Data::Dumper; my $now = ParseDate( scalar localtime()); my $then = DateCalc( $now, ParseDateDelta( "7 hours 48 minutes ago" )) +; my $err; my $re = qr/ ^.*? # Skip the first part \[([^\]]+)\]\s+ # capture everything between [] "[^"]+"\s+ # skip a quoted string and whitespace .*? # and a couple of numbers or blanks "( [^"]+ )" # capture the next quoted string /x; my %referrers; while(<DATA>) { my @chunks = /$re/; my $ts = ParseDate $chunks[0]; print "The line '@chunks' was logged ", Delta_Format( DateCalc( $ts, $now, \$err ), 2, ("%mt")), " minutes ago."; if ( Date_Cmp( $ts, $then ) > 0 and Date_Cmp( $ts, $now ) < 0 ) { print "The previous line is within the window. Counting..."; $referrers{$chunks[1]}++; } else { print "Discarding previous line"; } } print "\nThese are the referrers counted:\n", Dumper(\%referrers); __DATA__ 24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/header_ao +d2_08.gif HTTP/1.0" 200 663 "http://www.indystar.com/help/help/availa +ble.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)" 24.208.200.247 - - [10/Dec/2002:18:08:13 -0500] "GET /images/header_ao +d2_10.gif HTTP/1.0" 304 - "http://www.indystar.com/help/help/availabl +e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)" 24.208.200.247 - - [10/Dec/2002:18:11:19 -0500] "GET /images/storysear +ch2.gif HTTP/1.0" 200 142 "http://www.indystar.com/help/help/availabl +e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"

    Produces

    C:\test>218961 The line '10/Dec/2002:18:05:09 -0500 http://www.indystar.com/help/help +/available.html' was logged 469.23 minutes ago. Discarding previous line The line '10/Dec/2002:18:08:13 -0500 http://www.indystar.com/help/help +/available.html' was logged 466.17 minutes ago. The previous line is within the window. Counting... The line '10/Dec/2002:18:11:19 -0500 http://www.indystar.com/help/help +/available.html' was logged 463.07 minutes ago. The previous line is within the window. Counting... These are the referrers counted: $VAR1 = { 'http://www.indystar.com/help/help/available.html' => '2' };

    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

      BrowserUK, I don't think I quite understand your code. I modified it to read my data and looks like I don't have it quite right:

      #!/usr/local/bin/perl -slw use strict; use Date::Manip; use Data::Dumper; my $now = ParseDate( scalar localtime()); print "now is $now<p>"; my $then = DateCalc( $now, ParseDateDelta( "7 hours 48 minutes ago" )) +; my $err; open LOGFILE, "datafile.html" || die "Can't open file"; my $re = qr/ ^.*? # Skip the first part \[([^\]]+)\]\s+ # capture everything between [] "[^"]+"\s+ # skip a quoted string and whitespace .*? # and a couple of numbers or blanks "( [^"]+ )" # capture the next quoted string /x; my %referrers; while(<LOGFILE>) { my @chunks = /$re/; my $ts = ParseDate $chunks[0]; print "The line '@chunks' was logged ", Delta_Format( DateCalc( $ts, $now, \$err ), 2, ("%mt")), " minutes ago."; if ( Date_Cmp( $ts, $then ) > 0 and Date_Cmp( $ts, $now ) < 0 ) { print "The previous line is within the window. Counting..."; $referrers{$chunks[1]}++; } else { print "Discarding previous line"; } } print "\nThese are the referrers counted:\n", Dumper(\%referrers);
      datafile.html contains (in part):

      68.22.179.211 - - [15/Dec/2002:14:52:12 -0500] "GET /scripts/s_code.js HTTP/1.1" 304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozilla/ 4.0 (compatible; MSIE 5.5; Windows 98)"
      152.163.188.37 - - [15/Dec/2002:14:52:12 -0500] "GET /icons/unknown.gif HTTP/1.1 " 200 245 "http://www.indystar.com/print/articles/?S=D" "Mozilla/4.0 (compatible ; MSIE 5.5; AOL 7.0; Windows 98)"
      68.22.179.211 - - [15/Dec/2002:14:52:12 -0500] "GET /images/white_159x60.gif HTT P/1.1" 304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mo zilla/4.0 (compatible; MSIE 5.5; Windows 98)"
      141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /print/articles/2/008227-9 652-031.html HTTP/1.0" 200 7275 "http://www.fark.com/" "Mozilla/4.79 [en] (Windo ws NT 5.0; U)"
      68.22.179.211 - - [15/Dec/2002:14:52:13 -0500] "GET /images/black_1x60.gif HTTP/ 1.1" 304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozi lla/4.0 (compatible; MSIE 5.5; Windows 98)"
      68.22.179.211 - - [15/Dec/2002:14:52:13 -0500] "GET /images/69.gif HTTP/1.1" 200 1348 "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozilla/4 .0 (compatible; MSIE 5.5; Windows 98)"
      141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_01.gif HTTP/1.0" 200 2011 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
      141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_15.gif HTTP/1.0" 200 4162 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
      141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_10.gif HTTP/1.0" 200 3034 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
      141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/go_blue.gif HTTP/1 .0" 200 133 "http://www.indystar.com/print/articles/2/008227-9652-031.html" "Moz illa/4.79 [en] (Windows NT 5.0; U)"
      141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/aod_searchend2.gif HTTP/1.0" 200 186 "http://www.indystar.com/print/articles/2/008227-9652-031.htm l" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
      24.79.125.220 - - [15/Dec/2002:14:52:13 -0500] "GET /images/coheader2_aod_08.gif HTTP/1.1" 304 - "http://www.indystar.com/forums/showthread.php?s=&postid=177044 " "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
      24.79.125.220 - - [15/Dec/2002:14:52:13 -0500] "GET /images/coheader2_aod_10.gif HTTP/1.1" 304 - "http://www.indystar.com/forums/showthread.php?s=&postid=177044 " "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
      141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/email.gif HTTP/1.0 " 200 138 "http://www.indystar.com/print/articles/2/008227-9652-031.html" "Mozil la/4.79 [en] (Windows NT 5.0; U)"
      66.149.178.96 - - [15/Dec/2002:14:52:14 -0500] "GET /forums/showthread.php?s=&po stid=177042 HTTP/1.1" 200 7302 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1 .0.1) Gecko/20021003"
      24.79.125.220 - - [15/Dec/2002:14:52:14 -0500] "GET /images/coheader2_aod_11.gif HTTP/1.1" 200 954 "http://www.indystar.com/forums/showthread.php?s=&postid=1770 44" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"

      Edit: Added <code> tags. Escaped [s and ]s. larsen

        Hi. Please read Site How To before you submit code next time and save the editors and yourself a lot of work. Thanks.

        I just appended the data lines from above to the end of the code I gave you at pulling by regex and it parsed it correctly.

        Ouput

        Then I looked at your version of the code and noticed this:

        open LOGFILE, "datafile.html" || die "Can't open file";

        The problem with this line is that because you are not using brackets around the parameters to open combined with the relatively high presedence of ||, this is being parsed as

        open( LOGFILE, ("datafile.html" || die "Can't open file") );

        which as the first part of the || statement is always true, the second part ('die die "Can't open file"') is simply being optimised away meaning that even if the open fails (because input file does not exist or is not in the current subdirectory etc), you will never see any error msg. Could this be your problem?

        The fix is to use either

        open(LOGFILE, "datafile.html") || die "Can't open file$!";

        or

        open LOGFILE, "datafile.html" or die "Can't open file$!"

        Please also note the inclusion of $! in the error message. This will tell you why the open failed if it does, not just if. See Error Indicators for further details.

        The second thing I noted was the name of the file: "datafile.html"?? If this is a logfile, why is it named .html? If the file conatains html tags, the regex supplied will not parse the data.

        Your not by any chance viewing and saving the logs via a web interface are you? If so, you need to cut&paste from the screen to a file or use "Save as...type *.txt" if your browser has that option in order to remove the html tags from the file.

        If that doesn't explain and allow you to fix the problem come back and post the error message or otherwise describe what you are seeing (eg. No output, wrong output, etc).

        No need to re-post the code or data again unless it has changed substantially.

        Good luck.


        Examine what is said, not who speaks.

Re: pulling by regex
by Enlil (Parson) on Dec 11, 2002 at 01:04 UTC
    I am not all that certain what you mean by the last $log_line does not work. It seems that you are getting the information you want, but the h is missing from your urls. Which you remove on the line:
    $log_line =~ s/"GET.*"h/ /;

    One thing that I would advise is that instead of stripping everything around what you want, that you take some time to look over perlre so you get a little better grasp at the regular expressions and get what you want out of the lines instead. For instance, you are over using the dot star a lot, in most cases you would be better off putting a ? after the dot star so that it does not match all the way to end and then backtrack until it finds a match.

    Anyhow, I might do something like the following inside the for loop:

    foreach my $log_line(@log_data) { my ($date_string,$referrer) = ($log_line =~ /\[([^\]]+)\] "[^"]+"[^" +]+"([^"]+)"/); print "$date_string,$referrer<P>\n"; }

    Which as I mentioned gets what I want and nothing else. ( I am making some assumptions about the rest of your data, but based on what you have it should work).

    </rant>you should be using strict as well</rant>

    -enlil

Re: pulling by regex
by petral (Curate) on Dec 11, 2002 at 01:01 UTC
    Not sure what's wrong with the last $log_line, it works for me.   Another way to approach it is to remove the parts you do want:
    $log_line =~ /\[([^]]+)\] "[^"]+" [^"]+ "([^"]+)"/; print "$1 $2<p>";
      p
Re: pulling by regex
by Abigail-II (Bishop) on Dec 11, 2002 at 10:49 UTC
    So, you are reading your weblogs over and over, once for each request? That's not very efficient. Why not dump the log data into a database, and query the database? You could have the database do most of the work, including finding the 10 top.

    Abigail

      Hey, guys, thanks!!! This is a wonderful resource, and I incorporated some suggestions into the revised script below. I still have some questions, though!

      BrowserUk, I decided against using Date:Manip even though I really like that module. That's because the module instructions warn that it's slower than other time modules and this script will be used most often when the web server is overloaded with requests; thus, speed is essential.

      Abigail-II, a database would be nice, but the server is producing regular logs, so that's what I have to use.

      In the following script, here are my questions:

      1) Using strict produces errors that I don't have a global module loaded; what module is that?

      2) The simulated $month switch statement doesn't work as expected; instead of values 0 through 11, it gives everything a value of 1. Getting it changed to a number makes timelocal accurate.

      3. At the end, I pack all the referrers into an array; what I need to do is count each referrer as an unique URL, so that www.you.com is counted x times and www.me.com is counted y times so I can then tell the top referrer in the time period stipulated by the web page (which just has hours and minutes to enter). That will let me create output like
      www.you.com 22
      www.me.com 19
      etc
      How can I count an unknown value and produce this output? And is an array the best way to do it?

      Any and all ideas welcome, and thanks in advance. I really appreciate the help!

      Here's the script, followed by some raw log data:

      #!/usr/local/bin/perl #use strict; use CGI qw(:standard); use CGI::Carp qw(fatalsToBrowser carpout); use Time::Local; # Grab information returned by web page $hour = param ("hour"); $minute = param ("minute"); # Allow perl to write to browser window print "Content-type: text/html\n\n"; # Current time in seconds $now = time; # Convert submitted time to seconds $compare_time = ($hour * 3600) + ($minute * 60); # Times extracted by logs must be >= to $target $target = $now - $compare_time; open LOGFILE, "datafile.html" || die "Can't open file"; @log_data =<LOGFILE>; # Grab useful information from each line of the web log foreach $log_line(@log_data) { # Grab date/time and referer ($date_string, $referrer) = ($log_line =~ /\[([^\]]+)\] "[^"]+"[^"] ++"([^"]+)"/); # Replace / and : with spaces $date_string =~ s!/! !g; $date_string =~ s!:! !g; # Dump junk at end of line $date_string =~ s! -[0-9]+!!; # Split date/time into useful information ($day, $month, $year, $hhour, $min, $sec) = split(' ', $date_string +); # Convert month from text to number if ($month == 'Jan') {$month = 0} elsif ($month == 'Feb') {$month = 1} elsif ($month == 'Mar') {$month = 2} elsif ($month == 'Apr') {$month = 3} elsif ($month == 'May') {$month = 4} elsif ($month == 'Jun') {$month = 5} elsif ($month == 'Jul') {$month = 6} elsif ($month == 'Aug') {$month = 7} elsif ($month == 'Sep') {$month = 8} elsif ($month == 'Oct') {$month = 9} elsif ($month == 'Nov') {$month = 10} else {$month = 11} # Calculate time on the log line in seconds $log_time = timelocal($sec,$min,$hhour,$day,$month,$year); if ($log_time >= $target) { push @refers, $referrer; } }

      Some data:

      216.45.43.42 - - [12/Dec/2002:18:39:15 -0500] "GET /news/opinions/varv +el.gif HTTP/1.1" 302 313 "http://www.freerepublic.com/forum/a3a95ca3c +24a0.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CL +R 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /images/header_aod2 +_15.gif HTTP/1.1" 200 4162 "http://www.indystar.com/print/articles/1/ +007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; + Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /images/storysearch +2.gif HTTP/1.1" 200 142 "http://www.indystar.com/print/articles/1/007 +735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Wi +n 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:15 -0500] "GET /users/ads/misc/rem +ax_searchad3.gif HTTP/1.1" 200 2335 "http://www.indystar.com/print/ar +ticles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Wi +ndows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.37 +05)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/sports_03_a +od.gif HTTP/1.1" 200 3195 "http://www.indystar.com/print/articles/1/0 +07735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; +Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/email.gif H +TTP/1.1" 200 138 "http://www.indystar.com/print/articles/1/007735-767 +1-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4. +90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/print.gif H +TTP/1.1" 200 139 "http://www.indystar.com/print/articles/1/007735-767 +1-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4. +90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/sidelinksen +d2.gif HTTP/1.1" 200 1009 "http://www.indystar.com/print/articles/1/0 +07735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; +Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/pics2/image +-007735-7410.jpg HTTP/1.1" 200 18319 "http://www.indystar.com/print/a +rticles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; W +indows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3 +705)" 12.222.75.65 - - [12/Dec/2002:18:39:16 -0500] "GET /images/advertiseme +nt_250strip.gif HTTP/1.1" 200 238 "http://www.indystar.com/print/arti +cles/1/007735-7671-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Wind +ows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; .NET CLR 1.0.3705 +)" 12.222.75.65 - - [12/Dec/2002:18:39:17 -0500] "GET /users/ads/story/ma +cselect/macselect_250_Oct.gif HTTP/1.1" 200 10436 "http://www.indysta +r.com/print/articles/1/007735-7671-036.html" "Mozilla/4.0 (compatible +; MSIE 6.0; Windows 98; Win 9x 4.90; MSOCD; Q312461; YComp 5.0.0.0; . +NET CLR 1.0.3705)"

      update (broquaint): changed <pre> tags to <code> tags

      On reflection, that's a good idea, using MySQL. But wouldn't it waste time overwriting the same database each time the script is called, since there would be no point in keeping the old data? As I would envision this, translate to a date string plus the referrer and send them both to MySQL in two fields. Then process the input from the web page and use that information to pull from the database. What do you think?
        I think dumping the information of the log into a database each time the script is run is pretty stupid, and defeating the benefits. What's in the database is in the database, and doesn't have to be inserted again. Just dump the new logs to a database on a regular basis.

        Abigail

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://218961]
Approved by dws
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-20 19:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found