Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

count sort & output II

by mkent (Acolyte)
on Dec 19, 2002 at 16:42 UTC ( [id://221155]=perlquestion: print w/replies, xml ) Need Help??

mkent has asked for the wisdom of the Perl Monks concerning the following question:

Thanks to PhiRate, Browseruk, Sauoq and others, I now have working code, but it needs some finishing touches:

1) Specifically, the last block, "Sort and print", prints everything instead of displaying the first 10 and then providing a way to display the next 10 and scroll back to the previous 10. Any ideas on how to do this?

2) The code is a bit slow. On a log with 18,646,915 bytes it takes a little over a minute and a half. But on a big log with 22,623,798 bytes it times out after 3 minutes. Any ideas on how to make it run faster?

Thanks, and here's the code:

use strict; use warnings; use Date::Manip; use CGI qw/:standard/; # Make sure security is not compromised by calling unpathed programs. $ENV{PATH} = "/bin:/usr/bin:/usr/local/bin:"; $ENV{IFS}=""; # Use CGI to print the header print header; # Make variables local only my %referers = (); my $row = 0; my $counter = 0; # Retrieve and security-check parameters my $site = param('site'); my $hour = param('hour'); my $minute = param('minute'); if ($hour !~ /^\d\d?$/) { die('Invalid hour'); } if ($minute !~ /^\d\d?$/) { die('Invalid minute'); } # Get date object for the checkpoint my $check_date = ParseDate("${hour}hours ${minute}minutes ago"); # Select the server log - current 12/19/02 my $data = ''; if ($site eq 'star') {$data = 'indystar/access_log'} elsif ($site eq 'topics') {$data = 'topics/access_log'} else {$data = 'noblesville/access_log'} # Create headline for web page print "<h1>Referrers in the past $hour hours and $minute minutes</h1>" +; # File handling, one line at a time; if can't open, say why open(FH,"$data") || die('Could not open $data: $!'); while (my $line = <FH>) { next if ($line !~ /^\S+ \S \S \[(\S+) \S+\] "[^"]+" \d+ \d+ "([^"] ++)"/); my $line_date = ParseDate($1); # Check to see if the line date is in the range we're after next unless Date_Cmp($line_date, $check_date)>0; # If the referer is new, set to 1 entry, otherwise increment if (!$referers{$2}) { $referers{$2}=1; } else { $referers{$2}++; } } close(FH); # Sort and print for (sort {$referers{$b} <=> $referers{$a}} keys %referers) { print "$_ - $referers{$_}<p>"; unless (++$counter % 10) { print "Press Enter"; <STDIN> } }
Here's some sammple log data it's reading:

66.149.65.62 - - 19/Dec/2002:09:02:59 -0500 "GET /images/email.gif HTTP/1.1" 304 - "http://www.indystar.com/print/articles/5/009542-7185-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; H010818)"
66.149.65.62 - - 19/Dec/2002:09:02:59 -0500 "GET /images/print.gif HTTP/1.1" 304 - "http://www.indystar.com/print/articles/5/009542-7185-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; H010818)"
66.149.65.62 - - 19/Dec/2002:09:02:59 -0500 "GET /images/sidelinksend2.gif HTTP/1.1" 304 - "http://www.indystar.com/print/articles/5/009542-7185-036.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; H010818)"
66.72.209.208 - - 19/Dec/2002:09:02:59 -0500 "GET /images/pics2/image-005305-3314.jpg HTTP/1.1" 304 - "http://www.indystar.com/print/articles/8/005305-9938-038.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; .NET CLR 1.0.3705; MSIECrawler)"
66.134.224.29 - - 19/Dec/2002:09:02:59 -0500 "GET /images/header_aod2_01.gif HTTP/1.1" 200 2011 "http://www.indystar.com/print/articles/6/009478-6696-040.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)"
66.72.209.208 - - 19/Dec/2002:09:02:59 -0500 "GET /print/articles/0/005306-8900-038.html HTTP/1.1" 200 8361 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; .NET CLR 1.0.3705; MSIECrawler)"

Replies are listed 'Best First'.
Re: count sort & output II
by blokhead (Monsignor) on Dec 19, 2002 at 17:18 UTC
    You're running this as a CGI script, are you not? If so, then this line:
    unless (++$counter % 10) { print "Press Enter"; <STDIN> }
    ..won't work. Whoever suggested using this must have assumed you were running this script from the command line. If you want to get back only 10 results at a time from your CGI script, you need to either:
    • Reparse the entire log file each time you get a request for 10 results, or
    • Store the processed results somewhere, maybe a database
    Your CGI script can't store your hash in between HTTP requests, since it is unloaded from memory after every request1. I'll assume you don't want to wait 2 minutes for each 10 records, so I'd recommend writing to a cache file somewhere, storing all of the referrer => num records in your preferred sorted order. On each request to your script, just get the appropriate 10 lines from the file, if it exists, or else regenerate it if it doesn't exist. There should also be a mechanism to force a regeneration if the contents of the cache get stale.

    You may want to do some debugging to see where the major slowdown is.. If it's the while loop, there's probably not a lot you can do, but if it's in the sorting and copying of the hash keys (BTW, how big does this hash end up?), you may want to consider a non-hash-based solution. Your for-loop has to make a copy of all the hash keys in memory, which may take a long time, considering your HTTP-referer strings are all probably fairly long. Other monks might have some good ideas about improving this portion of the code, but I'm at a loss at the moment.

    Good luck,

    blokhead

    1: Of course, this is not true if your script is running under mod_perl, but it doesn't look like it

      Thanks. You're right, my Apache server isn't using mod_perl and actually I didn't know that might make a difference, so thanks for the explanation.

      For a cache, this is what I came up with. I'm sure there's a better way, so any suggestions welcome. It doesn't speed up the processing at all.

      # Sort and print for (sort {$referers{$b} <=> $referers{$a}} keys %referers) { if ($counter <= 10) { open (FILE, ">storage1.txt") || die('Could not open $storage1.t +xt: $1'); print "$_ - $referers{$_}<p>"; print FILE "$_ - $referers{$_}<p>"; ++$counter; } elsif ($counter > 10 && $counter <= 20) { if ($counter == 11) { print "<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;< +a href=\"storage2.txt\"><font color=\"FF0000\">Next</font></a><br>"; print FILE "<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n +bsp;<a href=\"storage2.txt\"><font color=\"FF0000\">Next</font></a><b +r>"; open (FILE2, ">storage2.txt") || die('Could not open $stora +ge2.txt: $1'); print FILE2 "$_ - $referers{$_}<p>"; ++$counter; } } elsif ($counter > 20 && $counter <= 30) { if ($counter == 21) { print "<p>a href=\"storage2.txt\"><font color=\"FF0000\">Pr +evious</font></a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a h +ref=\"storage3.txt\"><font color=\"FF0000\">Next</font></a><br>"; open (FILE3, ">storage3.txt") || die('Could not open $stora +ge3.txt: $1'); print FILE3 "$_ - $referers{$_}<p>"; ++$counter; } } elsif ($counter > 30 && $counter <= 40) { if ($counter == 31) { print "<p>a href=\"storage3.txt\"><font color=\"FF0000\">Pr +evious</font></a><br>"; open (FILE4, ">storage4.txt") || die('Could not open $stora +ge4.txt: $1'); print FILE4 "$_ - $referers{$_}<p>"; } } } close FILE; close FILE2; close FILE3; close FILE4;
Re: count sort & output II
by BrowserUk (Patriarch) on Dec 19, 2002 at 18:31 UTC

    Taking a quick scan of the reply's to your four top level nodes, pulling by regex, Pulling by regex II, count sort & output & count sort & output II and given that you pasted these lines from sauoq's post at Re: count sort & output

    # Sort and print for (sort {$referers{$b} <=> $referers{$a}} keys %referers) { print "$_ - $referers{$_}<p>"; unless (++$counter % 10) { print "Press Enter"; <STDIN> } }

    without making any attempt to make any adjustment to the keyboard input used, it is fairly obvious that you are simply cutting and pasting code you don't understand, and show no inclination to try to understand.

    It seems to me that you are much worse than the school kid who comes here thinking its a neat idea to get us to do his homework for him. You seem to think that its a neat idea to get us to write your software that you obviously intend to use for commercial purposes for you.

    You show no indication of making any effort whatsoever to learn perl. As I typed in response to your first attempt to get us to write this, but can't find now, so maybe I didn't submit it.

    Either Get the books, read the books and learn or employ a programmer. There are plenty out there looking for work.

    You'll get no further help from me, and I would strongly advise others to take the same attitude.


    Examine what is said, not who speaks.

      I'm sorry you feel that way. For the record, I'm not trying to get you to write software for me, but this is my very first program (admittedly ambitious) and I'm learning far more from examples like those posted here than I got from the book. My skills so far are very limited and I really appreciate the help from monks more experienced than me.

      Also for the record, in my inexperience I had no idea that the code above was for the command line; my CGI takes input from a web page.

      You may also recall that when this thread started, I posted code I had written from my limited understanding and you told me I should instead use a module I had never heard of before, and that's what I'm trying to do.

      Thanks.

Re: count sort & output II
by fruiture (Curate) on Dec 19, 2002 at 17:35 UTC

    There are indeed some caveats mabout your code. You'll for example get some uninitialized warnings, because $hour and $minute may be undef. param() returns undef for parameters that weren't submitted at all.

    Next, this is not a matter of effectivity, but style, use CGI to generate your HTML (and generate HTML, you don't yet). After al, you're loading the module so use it.

    "$data" is equivalent to $data, for $data is a plain string.

    In terms of speed: use defined() and exists():

    # why evaluate true/false? #while (my $line = <FH>) { # defined() is enough: while(defined( my $line = <FH> ) ){ #--- # why retrieve value und evaluate true/false? if (!$referers{$2}) { # when exists() can do thie way faster if( not exists $referers{$2} ){

    Next, you cannot do things via CGI you're used from the commandline. STDIN contains POST-Submitted form data, it is not terminal input in CGI context.

    You'll need to have another request in order to have a next page. For CGI this means all the work has to begin again.

    You see: the whole thing becomes more difficult. To create an acceptably fast output, store the data differently: Save things sorted by Referer and update that database (needn't be a real database but would be most efficient) whenever the actual logfile is updated.

    --
    http://fruiture.de
Re: count sort & output II
by sauoq (Abbot) on Dec 19, 2002 at 23:00 UTC

    I ignored the CGI aspect of your original question. CGI complicates things significantly and it is hard to know the best approach to suggest taking without asking several questions.

    • Do you expect many people to access this CGI program at the same time?
    • Must responsive must it be to new information?
    • What limits do you have on the size of the dataset you will be working with?
    • How else will you need to access the data?

    Etc. etc. etc. You might consider saving the output to a temporary file as suggested by an anonymonk here. That's good advice but it could work against you if you need to quickly have access to the newest entries.

    I think that, in order for us to really be helpful, you'll need to better explain your requirements.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: count sort & output II
by waswas-fng (Curate) on Dec 19, 2002 at 18:03 UTC
    Also you may want to have two logic forks, if size of file is > physicalmemsize use File::Sort else use in mem sort as you look at larger and larger log files you will become memory bound doing it this way.

    -Waswas

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://221155]
Approved by Mr. Muskrat
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-24 02:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found