data manipulation

Earindil has asked for the wisdom of the Perl Monks concerning the following question:

My boss wants the following data graphed out as a stacked bar graph in 5 minute intervals. As you can see, the data is collected every 2 minutes. Somehow, I need to break it down into 5 minute chunks and average the numbers in those chunks. So the first two sets would end up like this:

	08/07/03.22:55 TOT/3 TOT/3 TOT/3 TOT/3 TOT/3
	08/07/03.23:00 TOT/2 TOT/2 TOT/2 TOT/2 TOT/2

Every thing I've come up with/attemted has looked like a horrible mess/hack. Any suggestions on what I could do here would be highly appreciated.

.
.
.
08/07/03.22:55  1.029   1.172   1.03    0.086   0.382   
08/07/03.22:57  0.829   2.284   1.219   0.087   0.439   
08/07/03.22:59  2.437   0.792   0.809   0.087   0.305   

08/07/03.23:01  0.653   1.089   0.541   0.116   0.351   
08/07/03.23:03  0.823   2.407   0.826   0.04    0.23    

08/07/03.23:05  0.797   1.016   0.619   0.195   0.274   
08/07/03.23:07  1.742   0.901   1.078   0.087   0.328   
08/07/03.23:09  0.897   1.218   0.512   0.096   0.252   

08/07/03.23:11  1.146   1.281   0.521   0.086   0.276   
08/07/03.23:13  0.924   1.129   0.891   0.4     0.456   

08/07/03.23:15  1.103   1.383   1.645   0.09    0.387   
08/07/03.23:17  0.86    2.078   1.098   0.635   0.36    
08/07/03.23:19  3.832   1.911   0.808   0.086   0.309   
.
.
.

Here is the current code that is generating this data from the raw log file and not doing any 5 minute chunking.

open OUT, ">$Output_Directory/sitescope.dat" or die "Can't open OUTPUT
+ file: $!";
foreach (@data) {
        @tmp = split(/\t/);
                ($time,$date) = split(/ /,$tmp[0]);
                $date =~ s/2003/03/g;
                $time = substr($time,0,5);
                $mday = substr($date,3,2);
                $hour = substr($time,0,2);
                print OUT "$date\.$time\t";
                foreach (@tmp[1..5]) {
                        $seconds = $_/1000;
                        print OUT "$seconds\t";
                }
        print OUT "\n";
}
close OUT;
[download]

Comment on data manipulation Download Code

Replies are listed 'Best First'.
Re: data manipulation by japhy (Canon) on Aug 08, 2003 at 16:00 UTC
Here's how I would approach it -- it cuts down on redundant code (doing X for three lines, and doing a very similar X for two lines): `until (eof FILE) { for my $count (3, 2) { my @set = map scalar <FILE>, 1 .. $count; my ($time, @totals); for (@set) { chomp; my @f = split /\t/; $time \|\|= shift @f; # only want the FIRST time $totals[$_] += $f[$_] for 0 .. $#f; } print OUT join("\t", $time, map $_/$count, @totals), "\n"; } }` [download] That's incomplete, as far as formatting the proper date and time, but it's a pretty good start. _____________________________________________________ Jeff`[japhy]`Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area) `s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;`	[reply] [d/l]
Re: data manipulation by BrowserUk (Patriarch) on Aug 08, 2003 at 18:58 UTC
Accumulate the values into a hash where the keys are the 5 minute slots. Split every 3rd set of values between the 2 affected slots. Divide by 5 for the averages: #! perl -slw use strict; use Data::Dumper; my $re_line = qr[ ( \d\d ): ( \d\d ) ( .* $ ) ]x; my %timeslots; my $counter = 0; while( <DATA> =~ $re_line ) { # Extract hrs, mins, data into $1, $2, $ +3 my @temp = split ' ', $3; for( 0 .. 4 ) { # split every 3rd line between 2 5-minute slots if( $counter++ % 3 == 2 ) { $timeslots{ int( ( $1 * 60 + $2 ) / 5 ) }->[ $_ ] += $ +temp[ $_ ] / 2; $timeslots{ int( 1 + ( $1 * 60 + $2 ) / 5 ) }->[ $_ ] += $ +temp[ $_ ] / 2; } else { $timeslots{ int( ( $1 * 60 + $2 ) / 5 ) }->[ $_ ] += $ +temp[ $_ ]; } } } for my $ts ( sort{ $a <=> $b } keys %timeslots ) { printf '%2d:%02d' . ' [%f]' x 5 . $/, ( $ts * 5 ) / 60, ( $ts * 5 ) % 60, map{ $_ / 5 } @{ $timeslots{ $ts } } ; } __DATA__ 08/07/03.22:55 1.029 1.172 1.03 0.086 0.382 08/07/03.22:57 0.829 2.284 1.219 0.087 0.439 08/07/03.22:59 2.437 0.792 0.809 0.087 0.305 08/07/03.23:01 0.653 1.089 0.541 0.116 0.351 08/07/03.23:03 0.823 2.407 0.826 0.04 0.23 08/07/03.23:05 0.797 1.016 0.619 0.195 0.274 08/07/03.23:07 1.742 0.901 1.078 0.087 0.328 08/07/03.23:09 0.897 1.218 0.512 0.096 0.252 08/07/03.23:11 1.146 1.281 0.521 0.086 0.276 08/07/03.23:13 0.924 1.129 0.891 0.4 0.456 08/07/03.23:15 1.103 1.383 1.645 0.09 0.387 08/07/03.23:17 0.86 2.078 1.098 0.635 0.36 08/07/03.23:19 3.832 1.911 0.808 0.086 0.309 [download] Output `P:\test>282224 22:55 [0.776100] [0.770400] [0.508600] [0.043300] [0.194700] 23:00 [0.295800] [0.778400] [0.322300] [0.035900] [0.146700] 23:05 [0.679800] [0.525400] [0.388100] [0.070000] [0.143400] 23:10 [0.503700] [0.455500] [0.301100] [0.106800] [0.146200] 23:15 [1.048700] [0.994700] [0.718500] [0.153200] [0.202800] 23:20 [0.110300] [0.207800] [0.080800] [0.009000] [0.036000]` [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller If I understand your problem, I can solve it! Of course, the same can be said for you.	[reply] [d/l] [select]
Re: data manipulation by cfreak (Chaplain) on Aug 08, 2003 at 16:02 UTC
You might have a look at GD::Graph to produce a bar graph for you. As for five minute chunks, since you don't have any actual logging every five minutes it seems like you would always be grabbing every other entry, so why not just grab every other entry? I mean you can still say its 5 minutes because you wouldn't have any data to get it any closer. Hope that helps Lobster Aliens Are attacking the world!	[reply]
Re: data manipulation by Earindil (Beadle) on Aug 08, 2003 at 16:23 UTC
Here's what I worked out. It does actually work even though it looks like crap and I'm sure one of you can probably cut out half of the code. Please feel free to show me a better way to do this. open OUT, ">$Output_Directory/sitescope.dat" or die "Can't open OUTPUT + file: $!"; foreach (@data) { $counter++; @tmp = split(/\t/); ($time,$date) = split(/ /,$tmp[0]); $date =~ s/2003/03/g; $time = substr($time,0,5); $mday = substr($date,3,2); ($hour,$min) = split(/:/,$time); $mod = int($min/5); if ($mod != $last_mod) { print OUT "$last_date\.$last_time\t"; foreach (1..5) { $seconds = $seconds{$_}/$counter; printf OUT "%1.2f\t",$seconds; $seconds{$_}=0; } print OUT "\n"; $counter=0; } foreach (@tmp[1..5]) { $place++; $seconds{$place} += ($_/1000); } $place=0; $last_mod = $mod; $last_date = $date; $last_time = $time; } close OUT; [download] Original data: 08/08/03.07:55 1.61 1.158 0.71 0.209 0.535 08/08/03.07:57 1.094 0.887 0.96 0.126 0.322 08/08/03.07:59 5.131 0.986 1.029 0.095 0.251 08/08/03.08:01 1.071 1.274 0.638 0.197 0.347 08/08/03.08:03 1.166 0.984 0.598 0.092 0.583 Chunked data: 08/08/03.07:59 2.39 0.94 0.80 0.11 0.30 08/08/03.08:03 2.31 1.60 1.02 0.20 0.61	[reply] [d/l]
Re: Re: data manipulation by Not_a_Number (Prior) on Aug 08, 2003 at 16:58 UTC
Among other problems: `($time,$date) = split(/ /,$tmp[0]);` I don't think this line is doing what you think its doing. The contents of @tmp are something like `(08/08/03.08:01, 1.071, 1.274, 0.638, 0.197, 0.347)` ...and you are trying to split the first element in the array, `08/08/03.08:01`, on a non-existent space. This means that `$time` contains the whole string, while `$date` is empty! I think you need something like this: `my ( $date, $time ) = split( /\./, @tmp[0] );` And note that I've changed the order of your scalar variables around! my best advice to you is: `use strict; use warnings;` [download] dave	[reply] [d/l] [select]
Re: Re: Re: data manipulation by Earindil (Beadle) on Aug 08, 2003 at 17:43 UTC
`($time,$date) = split(/ /,$tmp[0]);` This is actually coming from the raw data file before any manipulation which looks like this: `10:40:32 08/08/2003 good website servername 8.32 sec, + 9 steps, 241K total, 35 images 16:72472 200 8328 ok 1557 662 1275 582 + 517 1129 1053 962 591` [download] I convert the time/date to the new format for use with Ploticus (an opensource graphing tool which I love) which likes to see a period between the two. Edit by tye, change PRE to CODE around long lines	[reply] [d/l] [select]