Working out frequency statistics with perl

wishartz has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am trying to plot a frequency chart from a log file, that contains the data, about how long it takes for a file to be brought back from tape and how large that file is. For example the bin size for the amount of seconds it takes to bring back a file from tape is 30 secs and will go up in increments of 30 secs and the output looks like this:

seconds  No of files
30:      72
60:      93
90:      75
120:     26
150:     18
180:     10
210:     5
240:     2
270:     1
300:     1
330:     2
360:     1
390:     1
450:     1
840:     1

And for the size of files, it's goes up in 25000000 kilobytes

Size of file   No of files
25000000:      48
50000000:      59
75000000:      30
100000000:     21
125000000:     69
150000000:     13
175000000:     4
200000000:     3
250000000:     1
350000000:     6
425000000:     13
450000000:     3
550000000:     2
675000000:     1
1075000000:    1
[download]

The problem is when checking the values it give me. They seem incorrect. I cannot really supply the log file, but I just wanted to know, by looking at the code, is it the correct way to work out the frequency.


$bin_stage = 30;   #bin size in number of seconds
$bin_filesize = 25000000; #bin size in kb

open(READMAP, "$command  |") || error_exit("Cannot run readmap, $!"); 
+#run the readmap command

while(<READMAP>) {    #loop through the output of the readmap command

        if (/StageTime/){

        @fields = split/\s+/;                                         
+  #Split the output by white space

        chop $fields[9];                                              
+  # Remove the period after the last digit

        $diskxStats[$i]{'filesize'}=$fields[2];

        $diskxStats[$i]{'ftptime'}=$fields[5];

        $diskxStats[$i]{'stagetime'}=$fields[9];

        if ( $fields[5] != 0 ){

        $diskxStats[$i]{'transferRate'} = $fields[2] / $fields[5];

        }

        else{

        $diskxStats[$i]{'transferRate'} = $fields[2];

        }

        if ( $fields[9] != 0 ){

        $diskxStats[$i]{'stagerate'} = $fields[2] / $fields[9];

        }

        else{

        $diskxStats[$i]{'stagerate'} = $fields[2];

        }

        $i++;

        }

}




my $j;



## Build the hash for frequency of stagetime##########################
+##############################

my $countstage = 0;

for  ( $countstage = $bin_stage; $countstage <= 3600; $countstage+=$bi
+n_stage){

        $nextbin = $countstage + $bin_stage;

 for $i ( 0 .. $#diskxStats ) {

        if ( $diskxStats[$i]{'stagetime'} >=$countstage && $diskxStats
+[$i]{'stagetime'} < $nextbin){

                $frequency_stage{$countstage}{'stage_counter'}++;

        }

     }

}




##Build the hash for frequency of filesizes###########################
+###############################

my $file_counter= 0;

my $countfile = 0;

$nextbin = 0;

my $array_index = 0;

for  ( $countfile = $bin_filesize; $countfile <= 1125000000; $countfil
+e+=$bin_filesize){

        $nextbin = $countfile + $bin_filesize;

 for $array_index ( 0 .. $#diskxStats ) {

        if ( $diskxStats[$array_index]{'filesize'} >=$countfile && $di
+skxStats[$array_index]{'filesize'} < $nextbin){

                $frequency_files{$countfile}{'filesize'}++;

        }

     }

}



my $stage_counter=0;

my @sorted_stage = sort { $frequency_stage{$a} cmp $frequency_stage{$b
+} } keys %frequency_stage;

foreach $i (@sorted_stage) {

    print "$i: ";

    foreach $stage_counter ( keys %{ $frequency_stage{$i} } ) {

         print "$frequency_stage{$i}{$stage_counter}\n";

    }

}



my @sorted_filesizes = sort { $frequency_files{$a} cmp $frequency_file
+s{$b} } keys %frequency_files;

foreach $i (@sorted_filesizes) {

    print "$i: ";

    foreach $file_counter ( keys %{ $frequency_files{$i} } ) {

         print "$frequency_files{$i}{$file_counter}\n";

    }

}
[download]

Thanks

Comment on Working out frequency statistics with perl Select or Download Code

Replies are listed 'Best First'.
Re: Working out frequency statistics with perl by moritz (Cardinal) on Jul 15, 2008 at 14:27 UTC
I don't know if your code is correct or not, but it's certainly more complex than it needs to be. Here's how I'd built the frequency statistics: `use POSIX qw(floor); sub build_histogram { my ($bucket_size, @items) = @_; my %result; for (@items) { my $bucket = $bucket_size * floor($_ / $bucket_size); $result{$bucket}++; } return %result; }` [download] This sorts everything from 0 to `$bucket_size` (exlusively) into the first bucket labeled 0 etc. If you want it to be sorted into a bucket labeled `$bucket_size` instead, use `ceil()` instead of `floor()`. Anyway, it's hard to tell if your code is correct or not. Try to use Data::Dumper to check if the intermediate data structures are correct, and perhaps even do it manually for a short excerpt of the log files and compare the results.	[reply] [d/l] [select]
Re: Working out frequency statistics with perl by pc88mxer (Vicar) on Jul 15, 2008 at 14:24 UTC
In this code, I would not blindly assume the last character is a period: `chop $fields[9]; # Remove the period after the last digit` [download] It's better to use a regex to remove it: `$fields[9] =~ s{\.$}{}; # Remove any trailing period` [download] Then again, this shouldn't be necessary since you are using `$fields[9]` as a number. For instance, the strings `"3.14"` and `"3.14."` will evaluate to the same number.	[reply] [d/l] [select]
Re: Working out frequency statistics with perl by hilitai (Monk) on Jul 15, 2008 at 15:22 UTC
You could try using the Statistics::Descriptive module to do the work for you.	[reply]
Re: Working out frequency statistics with perl by jethro (Monsignor) on Jul 15, 2008 at 14:50 UTC
Couldn't see anything wrong, but that won't help you a bit. Possibly you could tell us in what way these numbers look wrong without telling too much. Did you construct an example log file with just three or four entries and check by hand whether the output is right? This is the first thing I would do.	[reply]
Re: Working out frequency statistics with perl by graff (Chancellor) on Jul 16, 2008 at 01:10 UTC
You aren't showing the whole script, so I wonder... do you have these lines anywhere in your code (near the top)? `use strict; use warnings;` [download] Apart from that, I wonder if your "readmap" command ever outputs any lines with initial whitespace. If so, your split statement should be: `@fields = spit " ";` [download] because that will differ from `/\s+/` -- try this snippet to see the difference: `$_ = " begins with whitespace"; @a = split " "; @b = split /\s+/; print "quoted-space split returns ", scalar @a, " elements\n"; print "\\s+ split returns ", scalar @b, ", first one has length ",leng +th($b[0]),"\n";` [download] update: To clarify the issue, using "\s+" for splitting will get you into real trouble if the "readmap" output varies in terms of presence/absence of whitespace at the beginning of each line. If the output is consistent in this regard, then using "\s+" is probably not a problem -- you just need to make sure you've counted the field indexes correctly, so that `$fields[2]` etc really point at what you want them to point at.	[reply] [d/l] [select]
Re^2: Working out frequency statistics with perl by wishartz (Beadle) on Jul 16, 2008 at 10:43 UTC
Hello again Monks, thanks for everybody's feedback. To answer the last reply I got, I am using warnings and strict and I am not getting any errors. The output is always consistant as well. Here is a sample of the exact output, that from the readmap command if it matches the pattern StageTime. Read more... (22 kB) I don't understand the subroutine that was posted earlier. I don't understand what I am supposed to pass to items? Am I supposed to pass the sorted array, that is only the keys not the values? `sub build_histogram { my ($bucket_size, @items) = @_; my %result; for (@items) { my $bucket = $bucket_size * floor($_ / $bucket_size); print "$bucket\n"; $result{$bucket}++; } return %result; }` [download]	[reply] [d/l] [select]
Re^3: Working out frequency statistics with perl by moritz (Cardinal) on Jul 16, 2008 at 11:10 UTC
I don't understand the subroutine that was posted earlier. I don't understand what I am supposed to pass to items? A list of numbers from which you want to build your histogram. And it doesn't have to be sorted. Your sample data generates this histogram for me: `0: 67 30: 73 60: 93 90: 75 120: 26 150: 18 180: 10 210: 5 240: 2 270: 1 300: 1 330: 2 360: 1 390: 1 450: 1 840: 1` [download]	[reply] [d/l]
Re^4: Working out frequency statistics with perl by wishartz (Beadle) on Jul 16, 2008 at 12:48 UTC
Re^5: Working out frequency statistics with perl by jethro (Monsignor) on Jul 16, 2008 at 14:26 UTC