xorl has asked for the wisdom of the Perl Monks concerning the following question:

I'm crunching some web logs. However we've got two boxes (one an IIS and the other an Apache server). In order for our analysis tool to give us info, the logs have to be combined and sorted.

So right now I've got a perl script which pulls the logs from each box, combines them into one giant log, and puts that in the watch directory of the analysis software.

The problem is we only do this once a month and the combing step takes a while. I'd like to let the user know that something is actually happening and give them an idea of the progress. Here's the section of the

my $cmd = "cat /tmp/win.log /tmp/lnx.log | grep -v "/" . $nextmonth . +"/" | sort -k4.2 > /export/w3logs/" . $year . $month . ".log"; system($cmd);
The only way I can think of making a progress bar for this, is to pull it all into perl. The basic outline would be something like:
open the log files and the output file. read in each log file into a hash with the key based on the date of th +e item sort the hash keys and write the values to the output file

Now from previous experience I know the above outline will be a lot slower than how it is being done now. Also to note, we're talking about 2-4 Gb logfiles.

Is there a faster way of doing this or there some way to make a progress bar and still use the system call?

Replies are listed 'Best First'.
Re: progress bar for a system command
by zentara (Cardinal) on Jul 19, 2007 at 16:12 UTC
    I think it depends on whether you want a console progressbar, or something requiring windows, like this example from BrowserUk. The nice thing about Tk is it will run on XWindows or MSWindows. Just put your merge code into the work sub, and figure out a way to compute $progress. You might want to make the progressbar wider for a 2GB merge. :-)
    #!/usr/bin/perl use strict; use threads qw[ async ]; use threads::shared; our $WORKMAX ||= 1_000; #by BrowserUk ## A shared var to communicate progess between work thread and TK my $progress : shared = 0; sub work { for my $item ( 0 .. $WORKMAX ) { { lock $progress; $progress = ( $item / $WORKMAX ) * 100; } select undef, undef, undef, 0.01; ## do stuff that takes ti +me } } threads->new( \&work )->detach; ## For lowest memory consumption require (not use) ## Tk::* after you've started the work thread. require Tk::ProgressBar; my $mw = MainWindow->new; my $pb = $mw->ProgressBar()->pack(); my $repeat; $repeat = $mw->repeat( 100 => sub { print $progress; $repeat->cancel if $progress == 100; $pb->value($progress); } ); $mw->MainLoop;

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: progress bar for a system command
by Fletch (Bishop) on Jul 19, 2007 at 15:36 UTC

    If you just want to biff the user and let them know you haven't died, perhaps fork off the sort pipeline into another process and then have the parent do a waitpid inside an alarm-ified loop (see perlipc for more examples) that prints a "." or what not. When the wait returns after the child sort is done the parent can move on.

Re: progress bar for a system command
by duelafn (Parson) on Jul 19, 2007 at 15:50 UTC

    Since each individual log is sorted already you should not need to do so much work. The following outline should be sufficient:

    open the log files and the output file. read first line of each log file. write oldest line to output file and read next line from corresponding + input file. repeat.

    update: See Also File::MergeSort

    Good Day,
        Dean

Re: progress bar for a system command
by dsheroh (Monsignor) on Jul 19, 2007 at 17:13 UTC
    The quickest way to accomplish this is to avoid having to do it in the first place... Can your servers be configured to both send data to a single centralized log? I know *nix's syslog is network-aware and suspect something similar should exist for Apache and IIS, simply because companies with huge webserver farms aren't going to want to deal with each server's log individually.

    If you're not able to centralize it, then duelafn's suggestion is probably going to be the quickest way to combine them. It also allows you to use tell on each filehandle before or after reading from it to monitor how far you've gotten in processing that file. Combined with checking the size of each file before starting, that should give you enough information to generate a relatively meaningful progress indicator.