Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Re: Re: pulling by regex

by mkent (Acolyte)
on Dec 15, 2002 at 20:16 UTC ( [id://220042]=note: print w/replies, xml ) Need Help??

in reply to Re: pulling by regex
in thread pulling by regex

BrowserUK, I don't think I quite understand your code. I modified it to read my data and looks like I don't have it quite right:

#!/usr/local/bin/perl -slw use strict; use Date::Manip; use Data::Dumper; my $now = ParseDate( scalar localtime()); print "now is $now<p>"; my $then = DateCalc( $now, ParseDateDelta( "7 hours 48 minutes ago" )) +; my $err; open LOGFILE, "datafile.html" || die "Can't open file"; my $re = qr/ ^.*? # Skip the first part \[([^\]]+)\]\s+ # capture everything between [] "[^"]+"\s+ # skip a quoted string and whitespace .*? # and a couple of numbers or blanks "( [^"]+ )" # capture the next quoted string /x; my %referrers; while(<LOGFILE>) { my @chunks = /$re/; my $ts = ParseDate $chunks[0]; print "The line '@chunks' was logged ", Delta_Format( DateCalc( $ts, $now, \$err ), 2, ("%mt")), " minutes ago."; if ( Date_Cmp( $ts, $then ) > 0 and Date_Cmp( $ts, $now ) < 0 ) { print "The previous line is within the window. Counting..."; $referrers{$chunks[1]}++; } else { print "Discarding previous line"; } } print "\nThese are the referrers counted:\n", Dumper(\%referrers);
datafile.html contains (in part): - - [15/Dec/2002:14:52:12 -0500] "GET /scripts/s_code.js HTTP/1.1" 304 - "" "Mozilla/ 4.0 (compatible; MSIE 5.5; Windows 98)" - - [15/Dec/2002:14:52:12 -0500] "GET /icons/unknown.gif HTTP/1.1 " 200 245 "" "Mozilla/4.0 (compatible ; MSIE 5.5; AOL 7.0; Windows 98)" - - [15/Dec/2002:14:52:12 -0500] "GET /images/white_159x60.gif HTT P/1.1" 304 - "" "Mo zilla/4.0 (compatible; MSIE 5.5; Windows 98)" - - [15/Dec/2002:14:52:13 -0500] "GET /print/articles/2/008227-9 652-031.html HTTP/1.0" 200 7275 "" "Mozilla/4.79 [en] (Windo ws NT 5.0; U)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/black_1x60.gif HTTP/ 1.1" 304 - "" "Mozi lla/4.0 (compatible; MSIE 5.5; Windows 98)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/69.gif HTTP/1.1" 200 1348 "" "Mozilla/4 .0 (compatible; MSIE 5.5; Windows 98)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_01.gif HTTP/1.0" 200 2011 " ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_15.gif HTTP/1.0" 200 4162 " ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/header_aod2_10.gif HTTP/1.0" 200 3034 " ml" "Mozilla/4.79 [en] (Windows NT 5.0; U)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/go_blue.gif HTTP/1 .0" 200 133 "" "Moz illa/4.79 [en] (Windows NT 5.0; U)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/aod_searchend2.gif HTTP/1.0" 200 186 " l" "Mozilla/4.79 [en] (Windows NT 5.0; U)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/coheader2_aod_08.gif HTTP/1.1" 304 - " " "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/coheader2_aod_10.gif HTTP/1.1" 304 - " " "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)" - - [15/Dec/2002:14:52:13 -0500] "GET /images/email.gif HTTP/1.0 " 200 138 "" "Mozil la/4.79 [en] (Windows NT 5.0; U)" - - [15/Dec/2002:14:52:14 -0500] "GET /forums/showthread.php?s=&po stid=177042 HTTP/1.1" 200 7302 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1 .0.1) Gecko/20021003" - - [15/Dec/2002:14:52:14 -0500] "GET /images/coheader2_aod_11.gif HTTP/1.1" 200 954 " 44" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Q312461)"

Edit: Added <code> tags. Escaped [s and ]s. larsen

Replies are listed 'Best First'.
Re: Re: Re: pulling by regex
by BrowserUk (Patriarch) on Dec 15, 2002 at 21:56 UTC

    Hi. Please read Site How To before you submit code next time and save the editors and yourself a lot of work. Thanks.

    I just appended the data lines from above to the end of the code I gave you at pulling by regex and it parsed it correctly.


    Then I looked at your version of the code and noticed this:

    open LOGFILE, "datafile.html" || die "Can't open file";

    The problem with this line is that because you are not using brackets around the parameters to open combined with the relatively high presedence of ||, this is being parsed as

    open( LOGFILE, ("datafile.html" || die "Can't open file") );

    which as the first part of the || statement is always true, the second part ('die die "Can't open file"') is simply being optimised away meaning that even if the open fails (because input file does not exist or is not in the current subdirectory etc), you will never see any error msg. Could this be your problem?

    The fix is to use either

    open(LOGFILE, "datafile.html") || die "Can't open file$!";


    open LOGFILE, "datafile.html" or die "Can't open file$!"

    Please also note the inclusion of $! in the error message. This will tell you why the open failed if it does, not just if. See Error Indicators for further details.

    The second thing I noted was the name of the file: "datafile.html"?? If this is a logfile, why is it named .html? If the file conatains html tags, the regex supplied will not parse the data.

    Your not by any chance viewing and saving the logs via a web interface are you? If so, you need to cut&paste from the screen to a file or use "Save as...type *.txt" if your browser has that option in order to remove the html tags from the file.

    If that doesn't explain and allow you to fix the problem come back and post the error message or otherwise describe what you are seeing (eg. No output, wrong output, etc).

    No need to re-post the code or data again unless it has changed substantially.

    Good luck.

    Examine what is said, not who speaks.

      Thanks. It turned out to be the "html" tag where there was no html caused the data to be split into multiple lines instead of one solid line. I'm not sure how that happened, but saving the data as .txt fixed the problem.

      I've added a routine to allow it to select one of 3 separate logs and that works well.

      Now I need some more advice, if I may:

      1. How do I suppress the VAR1 and get just the output lines?

      2. How do I add a paragraph tag (p) after each output line?

      3. How would I sort them by number of referrers found?

      4. How would I display 10 results at a time, with the ability to go to the next 10 and back to the previous 10?


        Read and learn or employ a programmer.

        Examine what is said, not who speaks.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://220042]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2024-04-21 14:53 GMT
Find Nodes?
    Voting Booth?

    No recent polls found