in reply to pulling by regex

This may help get you started. Incorporating this into a CGI.pm script is left as AEFTR. (Hint: There's not much point in useing CGI; if your going to produce the html yourself.)

Using Date:Manip makes the date calculation part easy, (though the verbose but entirely opaque documentation has me gritting my teeth and banging my head every time). The regex I've used may not be robust, but there are plenty of other offers above to choose from.

#! perl -slw use strict; use Date::Manip; use Data::Dumper; my $now = ParseDate( scalar localtime()); my $then = DateCalc( $now, ParseDateDelta( "7 hours 48 minutes ago" )) +; my $err; my $re = qr/ ^.*? # Skip the first part \[([^\]]+)\]\s+ # capture everything between [] "[^"]+"\s+ # skip a quoted string and whitespace .*? # and a couple of numbers or blanks "( [^"]+ )" # capture the next quoted string /x; my %referrers; while(<DATA>) { my @chunks = /$re/; my $ts = ParseDate $chunks[0]; print "The line '@chunks' was logged ", Delta_Format( DateCalc( $ts, $now, \$err ), 2, ("%mt")), " minutes ago."; if ( Date_Cmp( $ts, $then ) > 0 and Date_Cmp( $ts, $now ) < 0 ) { print "The previous line is within the window. Counting..."; $referrers{$chunks[1]}++; } else { print "Discarding previous line"; } } print "\nThese are the referrers counted:\n", Dumper(\%referrers); __DATA__ 24.208.200.247 - - [10/Dec/2002:18:05:09 -0500] "GET /images/header_ao +d2_08.gif HTTP/1.0" 200 663 "http://www.indystar.com/help/help/availa +ble.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)" 24.208.200.247 - - [10/Dec/2002:18:08:13 -0500] "GET /images/header_ao +d2_10.gif HTTP/1.0" 304 - "http://www.indystar.com/help/help/availabl +e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)" 24.208.200.247 - - [10/Dec/2002:18:11:19 -0500] "GET /images/storysear +ch2.gif HTTP/1.0" 200 142 "http://www.indystar.com/help/help/availabl +e.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; H010818)"

Produces

C:\test>218961 The line '10/Dec/2002:18:05:09 -0500 http://www.indystar.com/help/help +/available.html' was logged 469.23 minutes ago. Discarding previous line The line '10/Dec/2002:18:08:13 -0500 http://www.indystar.com/help/help +/available.html' was logged 466.17 minutes ago. The previous line is within the window. Counting... The line '10/Dec/2002:18:11:19 -0500 http://www.indystar.com/help/help +/available.html' was logged 463.07 minutes ago. The previous line is within the window. Counting... These are the referrers counted: $VAR1 = { 'http://www.indystar.com/help/help/available.html' => '2' };

Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Replies are listed 'Best First'.
Re: Re: pulling by regex
by mkent (Acolyte) on Dec 15, 2002 at 20:16 UTC
    BrowserUK, I don't think I quite understand your code. I modified it to read my data and looks like I don't have it quite right:

    #!/usr/local/bin/perl -slw use strict; use Date::Manip; use Data::Dumper; my $now = ParseDate( scalar localtime()); print "now is $now<p>"; my $then = DateCalc( $now, ParseDateDelta( "7 hours 48 minutes ago" )) +; my $err; open LOGFILE, "datafile.html" || die "Can't open file"; my $re = qr/ ^.*? # Skip the first part \[([^\]]+)\]\s+ # capture everything between [] "[^"]+"\s+ # skip a quoted string and whitespace .*? # and a couple of numbers or blanks "( [^"]+ )" # capture the next quoted string /x; my %referrers; while(<LOGFILE>) { my @chunks = /$re/; my $ts = ParseDate $chunks[0]; print "The line '@chunks' was logged ", Delta_Format( DateCalc( $ts, $now, \$err ), 2, ("%mt")), " minutes ago."; if ( Date_Cmp( $ts, $then ) > 0 and Date_Cmp( $ts, $now ) < 0 ) { print "The previous line is within the window. Counting..."; $referrers{$chunks[1]}++; } else { print "Discarding previous line"; } } print "\nThese are the referrers counted:\n", Dumper(\%referrers);
    datafile.html contains (in part):

    68.22.179.211 - - [15/Dec/2002:14:52:12 -0500] "GET /scripts/s_code.js HTTP/1.1" 304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozilla/ 4.0 (compatible; MSIE 5.5; Windows 98)"
    152.163.188.37 - - [15/Dec/2002:14:52:12 -0500] "GET /icons/unknown.gif HTTP/1.1 " 200 245 "http://www.indystar.com/print/articles/?S=D" "Mozilla/4.0 (compatible ; MSIE 5.5; AOL 7.0; Windows 98)"
    68.22.179.211 - - [15/Dec/2002:14:52:12 -0500] "GET /images/white_159x60.gif HTT P/1.1" 304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mo zilla/4.0 (compatible; MSIE 5.5; Windows 98)"
    141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /print/articles/2/008227-9 652-031.html HTTP/1.0" 200 7275 "http://www.fark.com/" "Mozilla/4.79 [en] (Windo ws NT 5.0; U)"
    68.22.179.211 - - [15/Dec/2002:14:52:13 -0500] "GET /images/black_1x60.gif HTTP/ 1.1" 304 - "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozi lla/4.0 (compatible; MSIE 5.5; Windows 98)"
    68.22.179.211 - - [15/Dec/2002:14:52:13 -0500] "GET /images/69.gif HTTP/1.1" 200 1348 "http://www.indystar.com/print/articles/6/008596-6466-040.html" "Mozilla/4 .0 (compatible; MSIE 5.5; Windows 98)"
    141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_01.gif HTTP/1.0" 200 2011 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
    141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_15.gif HTTP/1.0" 200 4162 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
    141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_10.gif HTTP/1.0" 200 3034 "http://www.indystar.com/print/articles/2/008227-9652-031.ht ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
    141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/go_blue.gif HTTP/1 .0" 200 133 "http://www.indystar.com/print/articles/2/008227-9652-031.html" "Moz illa/4.79 [en] (Windows NT 5.0; U)"
    141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/aod_searchend2.gif HTTP/1.0" 200 186 "http://www.indystar.com/print/articles/2/008227-9652-031.htm l" "Mozilla/4.79 [en] (Windows NT 5.0; U)"
    24.79.125.220 - - [15/Dec/2002:14:52:13 -0500] "GET /images/coheader2_aod_08.gif HTTP/1.1" 304 - "http://www.indystar.com/forums/showthread.php?s=&postid=177044 " "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
    24.79.125.220 - - [15/Dec/2002:14:52:13 -0500] "GET /images/coheader2_aod_10.gif HTTP/1.1" 304 - "http://www.indystar.com/forums/showthread.php?s=&postid=177044 " "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"
    141.154.123.193 - - [15/Dec/2002:14:52:13 -0500] "GET /images/email.gif HTTP/1.0 " 200 138 "http://www.indystar.com/print/articles/2/008227-9652-031.html" "Mozil la/4.79 [en] (Windows NT 5.0; U)"
    66.149.178.96 - - [15/Dec/2002:14:52:14 -0500] "GET /forums/showthread.php?s=&po stid=177042 HTTP/1.1" 200 7302 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1 .0.1) Gecko/20021003"
    24.79.125.220 - - [15/Dec/2002:14:52:14 -0500] "GET /images/coheader2_aod_11.gif HTTP/1.1" 200 954 "http://www.indystar.com/forums/showthread.php?s=&postid=1770 44" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"

    Edit: Added <code> tags. Escaped [s and ]s. larsen

      Hi. Please read Site How To before you submit code next time and save the editors and yourself a lot of work. Thanks.

      I just appended the data lines from above to the end of the code I gave you at pulling by regex and it parsed it correctly.

      Ouput

      Then I looked at your version of the code and noticed this:

      open LOGFILE, "datafile.html" || die "Can't open file";

      The problem with this line is that because you are not using brackets around the parameters to open combined with the relatively high presedence of ||, this is being parsed as

      open( LOGFILE, ("datafile.html" || die "Can't open file") );

      which as the first part of the || statement is always true, the second part ('die die "Can't open file"') is simply being optimised away meaning that even if the open fails (because input file does not exist or is not in the current subdirectory etc), you will never see any error msg. Could this be your problem?

      The fix is to use either

      open(LOGFILE, "datafile.html") || die "Can't open file$!";

      or

      open LOGFILE, "datafile.html" or die "Can't open file$!"

      Please also note the inclusion of $! in the error message. This will tell you why the open failed if it does, not just if. See Error Indicators for further details.

      The second thing I noted was the name of the file: "datafile.html"?? If this is a logfile, why is it named .html? If the file conatains html tags, the regex supplied will not parse the data.

      Your not by any chance viewing and saving the logs via a web interface are you? If so, you need to cut&paste from the screen to a file or use "Save as...type *.txt" if your browser has that option in order to remove the html tags from the file.

      If that doesn't explain and allow you to fix the problem come back and post the error message or otherwise describe what you are seeing (eg. No output, wrong output, etc).

      No need to re-post the code or data again unless it has changed substantially.

      Good luck.


      Examine what is said, not who speaks.

        Thanks. It turned out to be the "html" tag where there was no html caused the data to be split into multiple lines instead of one solid line. I'm not sure how that happened, but saving the data as .txt fixed the problem.

        I've added a routine to allow it to select one of 3 separate logs and that works well.

        Now I need some more advice, if I may:

        1. How do I suppress the VAR1 and get just the output lines?

        2. How do I add a paragraph tag (p) after each output line?

        3. How would I sort them by number of referrers found?

        4. How would I display 10 results at a time, with the ability to go to the next 10 and back to the previous 10?

        Thanks.