Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm trying to perform calculations on numbers which serve as the values for two different (but related) hashes.

Essentially, I have created a hash of hashes which allows me to print out the daily frequency of all of the words in my large corpus.This script is functional (although it does need optimizing/tidying-up a little, which I intend to do next!); however, at the moment, it only returns the raw/actual frequency of words in my corpus per day.

What I want to do instead is find the normalised/relative frequency of each word per day. So, essentially, for each word and each day, I need to divide the current frequency returned by the above script, by the total number of words used on that day, and then multiply this number by 100.

My current script is as follows (illustrated below the cut using an SSCCE) :

use strict; use warnings; use Data::Dumper; my %mycorpus = ( text1 => "<p><time datetime=2017-09-04T05:23:39Z>04/09/17 06 +:23:39</time> Irrelevant text that I do not need. ##a## gt##a b c## ##a b a c d A b## <97> 164 notes", text2 => "p><time datetime=2017-09-30T18:20:56Z>30/09/17 1 +9:20:56</time> Irrelevant text that may feature the word soft, softest, o +r softly. ##a## ##a b a ## <97> 379 notes Irrelevant text.", text3 => "<p><time datetime=2017-09-30T19:27:03Z>30/09/17 +19:27:03</time> ##C## ##A## ##b## <97> 180 notes Irrelevant text." ); my %counts; my %overallcounts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; #find the dates if ($mycorpus{$filename} =~ /(\d{4}-\d{2}-\d{2})T/g){ $date = $1; } #find only the relevant text while ($mycorpus{$filename} =~ /##(.*)##/g){ $hashtags = $1; #split text into words my @words = split /\W+/, $hashtags; foreach my $word (@words){ if ($word =~ /(\w+)/gi){ $word =~ tr/A-Z/a-z/; $counts{$date}{$word}++; $overallcounts{$date}++; #new hash to help with re +lative frequency overall count per day } } } } print Dumper \%counts;

The current output is as follows:

$VAR1 = { '2017-09-04' => { 'b' => 3, 'd' => 1, 'c' => 2, 'a' => 5 }, '2017-09-30' => { 'b' => 2, 'c' => 1, 'a' => 4 } };

However, my expected output would be something like this (I've rounded to 2 d.p, but this is not necessary for the final results).

$VAR1 = { '2017-09-04' => { 'b' => 27.27, 'd' => 9.09, 'c' => 18.18, 'a' => 45.45 }, '2017-09-30' => { 'b' => 28.57, 'c' => 14.29, 'a' => 57.14 } };

However, I am really struggling to work out how to do this. I can't picture how this script would look and I've been unable to find any relevant examples online.

I think that I might need to utilise a second hash, where the dates are the keys and the total number of words per day are the values (and I've done this in the example above). I would then need to divide each of the values in my original hash of hashes by the corresponding values in this new hash, and then multiply this new number by 100. The returned value would then need to be stored in my hash of hashes.

However, while I know the steps that I need to take in theory, I have really reached my current 'Perl-knowledge' limit, and I would really appreciate any advice on how to implement this final part of my code.

Replies are listed 'Best First'.
Re: Calculations using values from two different hashes
by choroba (Cardinal) on Nov 25, 2017 at 10:33 UTC
    If you need to sum the values of each subhash, List::Util can help you with that. Then just divide each number by the sum and multiply by 100.
    use List::Util qw{ sum }; for my $report (values %counts) { my $sum = sum(values %$report); $_ = sprintf '%.2f', 100 * $_ / $sum for values %$report; }

    So, no need for %overallcounts.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thanks for this! I had stumbled upon List::Util in my Googling, but I couldn't work out how to actually incorporate it into my code. Will do some experimenting with this now. Thanks again!
Re: Calculations using values from two different hashes
by FreeBeerReekingMonk (Deacon) on Nov 25, 2017 at 10:40 UTC
    Append this after all the calculations:

    foreach my $date (keys %counts){ my $overallcount = $overallcounts{$date}; foreach my $word (keys %{$counts{$date}}){ $counts{$date}{$word."_average"} = sprintf("%.2f", 100 * $counts{$ +date}{$word} / $overallcount); } $counts{$date}{"_overallcount"} = $overallcount; } print Dumper \%counts;

    Which yields:

    $VAR1 = { '2017-09-30' => { 'c' => 1, 'b' => 2, 'a_average' => '57.14', '_overallcount' => 7, 'b_average' => '28.57', 'c_average' => '14.29', 'a' => 4 }, '2017-09-04' => { 'd' => 1, 'c_average' => '18.18', '_overallcount' => 11, 'b' => 3, 'a' => 5, 'b_average' => '27.27', 'd_average' => '9.09', 'c' => 2, 'a_average' => '45.45' } };

    You can then replace {$word."_average"} with {$word} to overwrite the original values.

      This works brilliantly, thank you very much! Just to double check that I've understood everything here, does the "%.2f" part of the code specify that only the first two decimal places should be returned?
        does the "%.2f" part of the code specify that only the first two decimal places should be returned?

        Yes.
        More strictly speaking the returned value will be the actual value, rounded to two decimal places in accordance with the rule "round to nearest, ties to even".
        C:\>perl -le "printf '%.2f', 0.245;" 0.24 C:\>perl -le "printf '%.2f', 0.255;" 0.26
        Cheers,
        Rob
        Hello Marie oops.. Maire,

        > Just to double check that I've understood everything here, does the "%.2f"..

        This is covered under sprintf documentation, specifically in the paragraph "precision, or maximum width".

        Anyway try it to see:

        # pay attention to MSWin32 double quotes! perl -e "printf qq($_\n),3.141592 for '%f','%.0f','%.1f','%.2f'" 3.141592 3 3.1 3.14

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.