davistv has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I've got some huge apache log files which I want to search backwards for the last access to a particular url. I can do this from the command line very effectively with this:

tac reall_big_access_log | grep -m 1 "/myapp/"

It's really fast, which is what I like about it. The -m 1 part of the grep command kills tac upon finding the first match, so file reading is minimized.

I've tried several ways of writing a script to interate over these access logs and execute the same statement, but I've had no luck. Each method I've tried that involved system() resulted in the tac command not being killed by the grep -m 1 statement, so the tac commands crawl through the entirety of every log file, which takes a very long time.

Below is my current script, which relies on File::ReadBackwards to emulate tac. This is quite a bit slower than the command line method as well. Could anyone suggest a good way to just execute the command line above inside a script?

Thank You,
Troy

#!/usr/bin/perl use strict; use Date::Parse; use File::ReadBackwards; my $log_dir = '/var/log/httpd/'; my $session_limit = 3600; # 1 hour my ($sec,$min,$hour,$mday,$month,$year,$wday,$yday,$isdst) = localtime +(time); $year += 1900; $month++; my $date_cmp_str = join('', ($year, sprintf('%02d', $month), sprintf(' +%02d', $mday), $hour, $min, $sec)); print 'myapp user status' . "\n"; my @expired; my @active; opendir(DIR, $log_dir); my @files = grep { /_access_log$/ } readdir(DIR); closedir DIR; print STDERR "Files = " . join(", ", @files) . "\n\n"; sleep(3); foreach my $lf (@files) { my ($domain, $t1, $t2) = split("_", $lf); my $myapp_dir = '/w3/' . $domain . '/myapp/'; if (!-e $myapp_dir) { next; } else { # Find last login my $file = $log_dir . $lf; my $last_line = ''; my $bw = File::ReadBackwards->new($file) or die "can't + read '$file' $!"; my $found = 0; while( !$found && defined( my $log_line = $bw->readlin +e ) ) { if ($log_line =~ /GET \/myapp\//) { $last_line = $log_line; $bw->close(); } } if ($last_line eq '') { print "No access in logs.\n"; } else { my ($ip, $d1, $d2, $date, $tz, $method, $url, +$rest) = split(" ", $last_line); $date = substr($date, 1); $tz = substr($tz, 0, -1); my $cdate = $date . ' ' . $tz; my ($ss,$mm,$hh,$dy,$mon,$yr,$zone) = strptime +($cdate); $yr += 1900; $mon++; my $date_str = join('', ($yr, sprintf('%02d', +$mon), sprintf('%02d', $dy), $hh, $mm, $ss)); # print "Last myapp hit on $lf to $url by $ip +on $date_str \n"; #print "$mon/$dy/$yr $hh:$mm:$ss > $month/$mda +y/$year $hour:$min:$sec ?\n"; my ($domain, $t3, $t4) = split("_", $lf); if ($date_str > ($date_cmp_str - $session_limi +t)) { print "User logged in! $domain last hi +t $mon/$dy/$yr $hh:$mm:$ss by IP $ip\n"; } else { print "$domain last hit $mon/$dy/$yr $ +hh:$mm:$ss by IP $ip\n"; } } } }

Replies are listed 'Best First'.
Re: Emulating command line pipe
by BrowserUk (Patriarch) on Aug 31, 2007 at 14:43 UTC
      Thank you for your suggestion. This works fine in a script that gets executed at the command line, but when run as a cgi I see processes get split up, which prevents the -m 1 grep argument from killing tac:

      apache 2717 0.1 0.1 5112 3356 ? S 12:42 0:00 /usr/ +bin/perl /w3/mysite.com/myapp/cgi-bin/myapp_sessions.cgi apache 2920 0.0 0.0 2296 1024 ? S 12:42 0:00 sh -c + tac /var/log/httpd/my_big_access_log | grep -m 1 /myapp/ apache 2921 24.9 0.0 1572 408 ? R 12:42 0:07 tac / +var/log/httpd/my_big_access_log
      Here's my CGI version, would you have any other advice by chance?

      Thank You,
      Troy

      ---
      #!/usr/bin/perl use strict; use CGI; use Date::Parse; my $log_dir = '/var/log/httpd/'; my $session_limit = 3600; # 1 hour my ($sec,$min,$hour,$mday,$month,$year,$wday,$yday,$isdst) = localtime +(time); $year += 1900; $month++; my $date_cmp_str = join('', ($year, sprintf('%02d', $month), sprintf(' +%02d', $mday), $hour, $min, $sec)); my $cgiq = new CGI; print $cgiq->header(); print '<html><head><title>myapp user status</title><link rel="styleshe +et" href="/style/login.css" type="text/css" />'; print '<style type="text/css">li { font-size: 14px; }</style></head><b +ody><h1>myapp user status</h1>'; print "\n<h2>Last known user activity:</h2>\n<ul>\n"; my @expired; my @active; opendir(DIR, $log_dir); my @files = grep { /_access_log$/ } readdir(DIR); closedir DIR; print STDERR "Files = " . join(", ", @files) . "\n\n"; sleep(3); foreach my $lf (@files) { # make sure this is an myapp site my ($domain, $t1, $t2) = split("_", $lf); my $myapp_dir = '/w3/' . $domain . '/myapp/'; if (!-e $myapp_dir) { next; } else { # Find last login my $file = $log_dir . $lf; my $last_line = ''; my $pid = open IN, "tac $file | grep -m 1 /myapp/ |" or die $! +; while( <IN> ) { $last_line = $_; } close IN; if ($last_line eq '') { # print "No login in logs.\n"; } else { my ($ip, $d1, $d2, $date, $tz, $method, $url, $rest) = spl +it(" ", $last_line); $date = substr($date, 1); $tz = substr($tz, 0, -1); my $cdate = $date . ' ' . $tz; my ($ss,$mm,$hh,$dy,$mon,$yr,$zone) = strptime($cdate); $yr += 1900; $mon++; my $date_str = join('', ($yr, sprintf('%02d', $mon), sprin +tf('%02d', $dy), $hh, $mm, $ss)); my ($domain, $t3, $t4) = split("_", $lf); if ($date_str > ($date_cmp_str - $session_limit)) { print "<li><font color='red'>User logged in! <a href=' +http://$domain'>$domain</a> last hit $mon/$dy/$yr $hh:$mm:$ss by IP $ +ip</font></li>\n"; } else { print "<li>$domain last hit $mon/$dy/$yr $hh:$mm:$ss b +y IP $ip</li>\n"; } } } } print "</ul></body></html>";
Re: Emulating command line pipe
by dsheroh (Monsignor) on Aug 31, 2007 at 14:53 UTC
    I'm not familiar with File::ReadBackwards, but the core of your handling (and most likely location for a problem) is
    while( !$found && defined( my $log_line = $bw->readline ) ) { if ($log_line =~ /GET \/myapp\//) { $last_line = $log_line; $bw->close(); } }
    This probably works reasonably well, given that you would expect $bw->readline to return undef after being closed, but a quick check of its docs on CPAN didn't turn up anything actually saying that it will. You're also wastefully checking $found on each iteration, even though there is nothing in the code you posted which ever sets it, aside from its initialization to 0 when declared.

    To ensure that neither of these is causing problems, I would rewrite the loop as

    while(defined( my $log_line = $bw->readline ) ) { if ($log_line =~ /GET \/myapp\//) { $last_line = $log_line; last; } } $bw->close();
    This modification skips testing $found, explicitly exits the loop once it finds a match, and will always close $bw whether the file matches or not.

    Within the loop itself, you can probably improve performance a fair bit by replacing the regex match with a call to substr, since you're looking for a static string and don't need the overhead introduced by the regex engine.

    If it's still slow after all that, try setting up a second loop that does the same thing, but reading the file normally from start to end and storing only the last match. Comparing the time taken by the two versions could provide some clues as to where the problem is. (i.e., If they both take just as long, then File::ReadBackwards isn't doing its job very well.)

      Thank you, I see how this optimization should help. Unfortunatley it looks like File::ReadBackwards isn't very efficient. It reduced the page load time a bit, but not significantly.

      I also tried having a separate script that just runs the tac command, and having the cgi call that script. But the "sh -c" effect that apache applies cascades to that script too. :-(
Re: Emulating command line pipe
by shmem (Chancellor) on Aug 31, 2007 at 16:41 UTC
    Don't use system - if you just open a pipe from tac to yourself, you have full control.
    my $tac_pid = open my $fh, "tac $logfile |"; while (<$fh>) { last if /$match/; } kill 13, $tac_pid; # SIGPIPE close $fh;

    But even so tac might be reading more than it should needed, since there's buffering on both sides of the pipe (and yes, I guess closing the filehandle $fh sends a SIGPIPE to tac... :-)

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      Thank you, shmem. Unfortunately once run as a cgi the process is no longer killed upon a match, I'm not sure why. I think this might be an issue with apache, but I'm not certain.

      Thank You,
      Troy
        AFAIK grep -m 1 exiting causes the OS to send a SIGPIPE to tac upon grep exit... hmm. Could it be that SIGPIPE is masked somewhere in mod_perl, ? is that masking propagated to the child? Is it even possible to mask a signal for a process group? Now that's something I don't know, you got me there :-)

        If that's the case - i.e. if SIGPIPE is masked - try my solution (which doesn't involve grep) and kill off tac with another signal (15 or, if it's renitent, 9 - SIGKILL).

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}