wishartz has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I am trying to plot a frequency chart from a log file, that contains the data, about how long it takes for a file to be brought back from tape and how large that file is. For example the bin size for the amount of seconds it takes to bring back a file from tape is 30 secs and will go up in increments of 30 secs and the output looks like this:

seconds No of files 30: 72 60: 93 90: 75 120: 26 150: 18 180: 10 210: 5 240: 2 270: 1 300: 1 330: 2 360: 1 390: 1 450: 1 840: 1 And for the size of files, it's goes up in 25000000 kilobytes Size of file No of files 25000000: 48 50000000: 59 75000000: 30 100000000: 21 125000000: 69 150000000: 13 175000000: 4 200000000: 3 250000000: 1 350000000: 6 425000000: 13 450000000: 3 550000000: 2 675000000: 1 1075000000: 1

The problem is when checking the values it give me. They seem incorrect. I cannot really supply the log file, but I just wanted to know, by looking at the code, is it the correct way to work out the frequency.

$bin_stage = 30; #bin size in number of seconds $bin_filesize = 25000000; #bin size in kb open(READMAP, "$command |") || error_exit("Cannot run readmap, $!"); +#run the readmap command while(<READMAP>) { #loop through the output of the readmap command if (/StageTime/){ @fields = split/\s+/; + #Split the output by white space chop $fields[9]; + # Remove the period after the last digit $diskxStats[$i]{'filesize'}=$fields[2]; $diskxStats[$i]{'ftptime'}=$fields[5]; $diskxStats[$i]{'stagetime'}=$fields[9]; if ( $fields[5] != 0 ){ $diskxStats[$i]{'transferRate'} = $fields[2] / $fields[5]; } else{ $diskxStats[$i]{'transferRate'} = $fields[2]; } if ( $fields[9] != 0 ){ $diskxStats[$i]{'stagerate'} = $fields[2] / $fields[9]; } else{ $diskxStats[$i]{'stagerate'} = $fields[2]; } $i++; } } my $j; ## Build the hash for frequency of stagetime########################## +############################## my $countstage = 0; for ( $countstage = $bin_stage; $countstage <= 3600; $countstage+=$bi +n_stage){ $nextbin = $countstage + $bin_stage; for $i ( 0 .. $#diskxStats ) { if ( $diskxStats[$i]{'stagetime'} >=$countstage && $diskxStats +[$i]{'stagetime'} < $nextbin){ $frequency_stage{$countstage}{'stage_counter'}++; } } } ##Build the hash for frequency of filesizes########################### +############################### my $file_counter= 0; my $countfile = 0; $nextbin = 0; my $array_index = 0; for ( $countfile = $bin_filesize; $countfile <= 1125000000; $countfil +e+=$bin_filesize){ $nextbin = $countfile + $bin_filesize; for $array_index ( 0 .. $#diskxStats ) { if ( $diskxStats[$array_index]{'filesize'} >=$countfile && $di +skxStats[$array_index]{'filesize'} < $nextbin){ $frequency_files{$countfile}{'filesize'}++; } } } my $stage_counter=0; my @sorted_stage = sort { $frequency_stage{$a} cmp $frequency_stage{$b +} } keys %frequency_stage; foreach $i (@sorted_stage) { print "$i: "; foreach $stage_counter ( keys %{ $frequency_stage{$i} } ) { print "$frequency_stage{$i}{$stage_counter}\n"; } } my @sorted_filesizes = sort { $frequency_files{$a} cmp $frequency_file +s{$b} } keys %frequency_files; foreach $i (@sorted_filesizes) { print "$i: "; foreach $file_counter ( keys %{ $frequency_files{$i} } ) { print "$frequency_files{$i}{$file_counter}\n"; } }

Thanks

Replies are listed 'Best First'.
Re: Working out frequency statistics with perl
by moritz (Cardinal) on Jul 15, 2008 at 14:27 UTC
    I don't know if your code is correct or not, but it's certainly more complex than it needs to be.

    Here's how I'd built the frequency statistics:

    use POSIX qw(floor); sub build_histogram { my ($bucket_size, @items) = @_; my %result; for (@items) { my $bucket = $bucket_size * floor($_ / $bucket_size); $result{$bucket}++; } return %result; }

    This sorts everything from 0 to $bucket_size (exlusively) into the first bucket labeled 0 etc. If you want it to be sorted into a bucket labeled $bucket_size instead, use ceil() instead of floor().

    Anyway, it's hard to tell if your code is correct or not. Try to use Data::Dumper to check if the intermediate data structures are correct, and perhaps even do it manually for a short excerpt of the log files and compare the results.

Re: Working out frequency statistics with perl
by pc88mxer (Vicar) on Jul 15, 2008 at 14:24 UTC
    In this code, I would not blindly assume the last character is a period:
    chop $fields[9]; # Remove the period after the last digit
    It's better to use a regex to remove it:
    $fields[9] =~ s{\.$}{}; # Remove any trailing period
    Then again, this shouldn't be necessary since you are using $fields[9] as a number. For instance, the strings "3.14" and "3.14." will evaluate to the same number.
Re: Working out frequency statistics with perl
by hilitai (Monk) on Jul 15, 2008 at 15:22 UTC
Re: Working out frequency statistics with perl
by jethro (Monsignor) on Jul 15, 2008 at 14:50 UTC
    Couldn't see anything wrong, but that won't help you a bit.

    Possibly you could tell us in what way these numbers look wrong without telling too much.

    Did you construct an example log file with just three or four entries and check by hand whether the output is right? This is the first thing I would do.

Re: Working out frequency statistics with perl
by graff (Chancellor) on Jul 16, 2008 at 01:10 UTC
    You aren't showing the whole script, so I wonder... do you have these lines anywhere in your code (near the top)?
    use strict; use warnings;
    Apart from that, I wonder if your "readmap" command ever outputs any lines with initial whitespace. If so, your split statement should be:
    @fields = spit " ";
    because that will differ from /\s+/ -- try this snippet to see the difference:
    $_ = " begins with whitespace"; @a = split " "; @b = split /\s+/; print "quoted-space split returns ", scalar @a, " elements\n"; print "\\s+ split returns ", scalar @b, ", first one has length ",leng +th($b[0]),"\n";
    update: To clarify the issue, using "\s+" for splitting will get you into real trouble if the "readmap" output varies in terms of presence/absence of whitespace at the beginning of each line. If the output is consistent in this regard, then using "\s+" is probably not a problem -- you just need to make sure you've counted the field indexes correctly, so that $fields[2] etc really point at what you want them to point at.
      Hello again Monks, thanks for everybody's feedback. To answer the last reply I got, I am using warnings and strict and I am not getting any errors. The output is always consistant as well. Here is a sample of the exact output, that from the readmap command if it matches the pattern StageTime.

      I don't understand the subroutine that was posted earlier. I don't understand what I am supposed to pass to items? Am I supposed to pass the sorted array, that is only the keys not the values?
      sub build_histogram { my ($bucket_size, @items) = @_; my %result; for (@items) { my $bucket = $bucket_size * floor($_ / $bucket_size); print "$bucket\n"; $result{$bucket}++; } return %result; }
        I don't understand the subroutine that was posted earlier. I don't understand what I am supposed to pass to items?

        A list of numbers from which you want to build your histogram. And it doesn't have to be sorted.

        Your sample data generates this histogram for me:

        0: 67 30: 73 60: 93 90: 75 120: 26 150: 18 180: 10 210: 5 240: 2 270: 1 300: 1 330: 2 360: 1 390: 1 450: 1 840: 1