comment on

Hi Monks,

I'm trying to perform calculations on numbers which serve as the values for two different (but related) hashes.

Essentially, I have created a hash of hashes which allows me to print out the daily frequency of all of the words in my large corpus.This script is functional (although it does need optimizing/tidying-up a little, which I intend to do next!); however, at the moment, it only returns the raw/actual frequency of words in my corpus per day.

What I want to do instead is find the normalised/relative frequency of each word per day. So, essentially, for each word and each day, I need to divide the current frequency returned by the above script, by the total number of words used on that day, and then multiply this number by 100.

My current script is as follows (illustrated below the cut using an SSCCE) :

use strict;
use warnings;
use Data::Dumper;

my %mycorpus = (
          text1 => "<p><time datetime=2017-09-04T05:23:39Z>04/09/17 06
+:23:39</time>
                    Irrelevant text that I do not need.
##a##
gt##a b c##
##a b a c d A b##
 <97> 164 notes",

            text2 => "p><time datetime=2017-09-30T18:20:56Z>30/09/17 1
+9:20:56</time>
            Irrelevant text that may feature the word soft, softest, o
+r softly.
##a##
##a b a ##

 <97> 379 notes
            Irrelevant text.",

            text3 => "<p><time datetime=2017-09-30T19:27:03Z>30/09/17 
+19:27:03</time>
##C##
##A##
##b##
 <97> 180 notes
 Irrelevant text."
);

my %counts;
my %overallcounts;

foreach my $filename (sort keys %mycorpus) {
        my $date;
        my $hashtags = '';

        #find the dates
        if ($mycorpus{$filename} =~ /(\d{4}-\d{2}-\d{2})T/g){
            $date = $1;
        }
        
        #find only the relevant text
        while ($mycorpus{$filename} =~ /##(.*)##/g){
            $hashtags = $1;
            
        #split text into words
            my @words = split /\W+/, $hashtags;
            
            foreach my $word (@words){
                
                if ($word =~ /(\w+)/gi){
                    $word =~ tr/A-Z/a-z/;
                    $counts{$date}{$word}++;
                    $overallcounts{$date}++; #new hash to help with re
+lative frequency overall count per day
                }
            }           
        }        
}

print Dumper \%counts;
[download]

The current output is as follows:

$VAR1 = {
          '2017-09-04' => {
                            'b' => 3,
                            'd' => 1,
                            'c' => 2,
                            'a' => 5
                          },
          '2017-09-30' => {
                            'b' => 2,
                            'c' => 1,
                            'a' => 4
                          }
        };
[download]

However, my expected output would be something like this (I've rounded to 2 d.p, but this is not necessary for the final results).

$VAR1 = {
          '2017-09-04' => {
                            'b' => 27.27,
                            'd' => 9.09,
                            'c' => 18.18,
                            'a' => 45.45
                          },
          '2017-09-30' => {
                            'b' => 28.57,
                            'c' => 14.29,
                            'a' => 57.14
                          }
        };
[download]

However, I am really struggling to work out how to do this. I can't picture how this script would look and I've been unable to find any relevant examples online.

I think that I might need to utilise a second hash, where the dates are the keys and the total number of words per day are the values (and I've done this in the example above). I would then need to divide each of the values in my original hash of hashes by the corresponding values in this new hash, and then multiply this new number by 100. The returned value would then need to be stored in my hash of hashes.

However, while I know the steps that I need to take in theory, I have really reached my current 'Perl-knowledge' limit, and I would really appreciate any advice on how to implement this final part of my code.

In reply to Calculations using values from two different hashes by Maire

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.