Earindil has asked for the wisdom of the Perl Monks concerning the following question:

My boss wants the following data graphed out as a stacked bar graph in 5 minute intervals. As you can see, the data is collected every 2 minutes. Somehow, I need to break it down into 5 minute chunks and average the numbers in those chunks. So the first two sets would end up like this:
	08/07/03.22:55 TOT/3 TOT/3 TOT/3 TOT/3 TOT/3
	08/07/03.23:00 TOT/2 TOT/2 TOT/2 TOT/2 TOT/2
Every thing I've come up with/attemted has looked like a horrible mess/hack. Any suggestions on what I could do here would be highly appreciated.
.
.
.
08/07/03.22:55  1.029   1.172   1.03    0.086   0.382   
08/07/03.22:57  0.829   2.284   1.219   0.087   0.439   
08/07/03.22:59  2.437   0.792   0.809   0.087   0.305   

08/07/03.23:01  0.653   1.089   0.541   0.116   0.351   
08/07/03.23:03  0.823   2.407   0.826   0.04    0.23    

08/07/03.23:05  0.797   1.016   0.619   0.195   0.274   
08/07/03.23:07  1.742   0.901   1.078   0.087   0.328   
08/07/03.23:09  0.897   1.218   0.512   0.096   0.252   

08/07/03.23:11  1.146   1.281   0.521   0.086   0.276   
08/07/03.23:13  0.924   1.129   0.891   0.4     0.456   

08/07/03.23:15  1.103   1.383   1.645   0.09    0.387   
08/07/03.23:17  0.86    2.078   1.098   0.635   0.36    
08/07/03.23:19  3.832   1.911   0.808   0.086   0.309   
.
.
.
Here is the current code that is generating this data from the raw log file and not doing any 5 minute chunking.
open OUT, ">$Output_Directory/sitescope.dat" or die "Can't open OUTPUT + file: $!"; foreach (@data) { @tmp = split(/\t/); ($time,$date) = split(/ /,$tmp[0]); $date =~ s/2003/03/g; $time = substr($time,0,5); $mday = substr($date,3,2); $hour = substr($time,0,2); print OUT "$date\.$time\t"; foreach (@tmp[1..5]) { $seconds = $_/1000; print OUT "$seconds\t"; } print OUT "\n"; } close OUT;

Replies are listed 'Best First'.
Re: data manipulation
by japhy (Canon) on Aug 08, 2003 at 16:00 UTC
    Here's how I would approach it -- it cuts down on redundant code (doing X for three lines, and doing a very similar X for two lines):
    until (eof FILE) { for my $count (3, 2) { my @set = map scalar <FILE>, 1 .. $count; my ($time, @totals); for (@set) { chomp; my @f = split /\t/; $time ||= shift @f; # only want the FIRST time $totals[$_] += $f[$_] for 0 .. $#f; } print OUT join("\t", $time, map $_/$count, @totals), "\n"; } }
    That's incomplete, as far as formatting the proper date and time, but it's a pretty good start.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Re: data manipulation
by BrowserUk (Patriarch) on Aug 08, 2003 at 18:58 UTC

    Accumulate the values into a hash where the keys are the 5 minute slots. Split every 3rd set of values between the 2 affected slots. Divide by 5 for the averages:

    #! perl -slw use strict; use Data::Dumper; my $re_line = qr[ ( \d\d ): ( \d\d ) ( .* $ ) ]x; my %timeslots; my $counter = 0; while( <DATA> =~ $re_line ) { # Extract hrs, mins, data into $1, $2, $ +3 my @temp = split ' ', $3; for( 0 .. 4 ) { # split every 3rd line between 2 5-minute slots if( $counter++ % 3 == 2 ) { $timeslots{ int( ( $1 * 60 + $2 ) / 5 ) }->[ $_ ] += $ +temp[ $_ ] / 2; $timeslots{ int( 1 + ( $1 * 60 + $2 ) / 5 ) }->[ $_ ] += $ +temp[ $_ ] / 2; } else { $timeslots{ int( ( $1 * 60 + $2 ) / 5 ) }->[ $_ ] += $ +temp[ $_ ]; } } } for my $ts ( sort{ $a <=> $b } keys %timeslots ) { printf '%2d:%02d' . ' [%f]' x 5 . $/, ( $ts * 5 ) / 60, ( $ts * 5 ) % 60, map{ $_ / 5 } @{ $timeslots{ $ts } } ; } __DATA__ 08/07/03.22:55 1.029 1.172 1.03 0.086 0.382 08/07/03.22:57 0.829 2.284 1.219 0.087 0.439 08/07/03.22:59 2.437 0.792 0.809 0.087 0.305 08/07/03.23:01 0.653 1.089 0.541 0.116 0.351 08/07/03.23:03 0.823 2.407 0.826 0.04 0.23 08/07/03.23:05 0.797 1.016 0.619 0.195 0.274 08/07/03.23:07 1.742 0.901 1.078 0.087 0.328 08/07/03.23:09 0.897 1.218 0.512 0.096 0.252 08/07/03.23:11 1.146 1.281 0.521 0.086 0.276 08/07/03.23:13 0.924 1.129 0.891 0.4 0.456 08/07/03.23:15 1.103 1.383 1.645 0.09 0.387 08/07/03.23:17 0.86 2.078 1.098 0.635 0.36 08/07/03.23:19 3.832 1.911 0.808 0.086 0.309

    Output

    P:\test>282224 22:55 [0.776100] [0.770400] [0.508600] [0.043300] [0.194700] 23:00 [0.295800] [0.778400] [0.322300] [0.035900] [0.146700] 23:05 [0.679800] [0.525400] [0.388100] [0.070000] [0.143400] 23:10 [0.503700] [0.455500] [0.301100] [0.106800] [0.146200] 23:15 [1.048700] [0.994700] [0.718500] [0.153200] [0.202800] 23:20 [0.110300] [0.207800] [0.080800] [0.009000] [0.036000]

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: data manipulation
by cfreak (Chaplain) on Aug 08, 2003 at 16:02 UTC

    You might have a look at GD::Graph to produce a bar graph for you. As for five minute chunks, since you don't have any actual logging every five minutes it seems like you would always be grabbing every other entry, so why not just grab every other entry? I mean you can still say its 5 minutes because you wouldn't have any data to get it any closer.

    Hope that helps

    Lobster Aliens Are attacking the world!
Re: data manipulation
by Earindil (Beadle) on Aug 08, 2003 at 16:23 UTC
    Here's what I worked out. It does actually work even though it looks like crap and I'm sure one of you can probably cut out half of the code. Please feel free to show me a better way to do this.
    open OUT, ">$Output_Directory/sitescope.dat" or die "Can't open OUTPUT + file: $!"; foreach (@data) { $counter++; @tmp = split(/\t/); ($time,$date) = split(/ /,$tmp[0]); $date =~ s/2003/03/g; $time = substr($time,0,5); $mday = substr($date,3,2); ($hour,$min) = split(/:/,$time); $mod = int($min/5); if ($mod != $last_mod) { print OUT "$last_date\.$last_time\t"; foreach (1..5) { $seconds = $seconds{$_}/$counter; printf OUT "%1.2f\t",$seconds; $seconds{$_}=0; } print OUT "\n"; $counter=0; } foreach (@tmp[1..5]) { $place++; $seconds{$place} += ($_/1000); } $place=0; $last_mod = $mod; $last_date = $date; $last_time = $time; } close OUT;
    Original data:
    08/08/03.07:55  1.61    1.158   0.71    0.209   0.535   
    08/08/03.07:57  1.094   0.887   0.96    0.126   0.322   
    08/08/03.07:59  5.131   0.986   1.029   0.095   0.251   
    08/08/03.08:01  1.071   1.274   0.638   0.197   0.347   
    08/08/03.08:03  1.166   0.984   0.598   0.092   0.583   
    
    Chunked data:
    
    08/08/03.07:59  2.39    0.94    0.80    0.11    0.30    
    08/08/03.08:03  2.31    1.60    1.02    0.20    0.61    
    
    

      Among other problems:

      ($time,$date) = split(/ /,$tmp[0]);

      I don't think this line is doing what you think its doing. The contents of @tmp are something like

      (08/08/03.08:01, 1.071, 1.274, 0.638, 0.197, 0.347)

      ...and you are trying to split the first element in the array, 08/08/03.08:01, on a non-existent space. This means that $time contains the whole string, while $date is empty!

      I think you need something like this:

      my ( $date, $time ) = split( /\./, @tmp[0] );

      And note that I've changed the order of your scalar variables around!

      my best advice to you is:

      use strict; use warnings;
      dave
        ($time,$date) = split(/ /,$tmp[0]);
        This is actually coming from the raw data file before any manipulation which looks like this:
        10:40:32 08/08/2003 good website servername 8.32 sec, + 9 steps, 241K total, 35 images 16:72472 200 8328 ok 1557 662 1275 582 + 517 1129 1053 962 591

        I convert the time/date to the new format for use with Ploticus (an opensource graphing tool which I love) which likes to see a period between the two.

        Edit by tye, change PRE to CODE around long lines