Hi Monks,
I'm trying to perform calculations on numbers which serve as the values for two different (but related) hashes.
Essentially, I have created a hash of hashes which allows me to print out the daily frequency of all of the words in my large corpus.This script is functional (although it does need optimizing/tidying-up a little, which I intend to do next!); however, at the moment, it only returns the raw/actual frequency of words in my corpus per day.
What I want to do instead is find the normalised/relative frequency of each word per day. So, essentially, for each word and each day, I need to divide the current frequency returned by the above script, by the total number of words used on that day, and then multiply this number by 100.
My current script is as follows (illustrated below the cut using an SSCCE) :
use strict; use warnings; use Data::Dumper; my %mycorpus = ( text1 => "<p><time datetime=2017-09-04T05:23:39Z>04/09/17 06 +:23:39</time> Irrelevant text that I do not need. ##a## gt##a b c## ##a b a c d A b## <97> 164 notes", text2 => "p><time datetime=2017-09-30T18:20:56Z>30/09/17 1 +9:20:56</time> Irrelevant text that may feature the word soft, softest, o +r softly. ##a## ##a b a ## <97> 379 notes Irrelevant text.", text3 => "<p><time datetime=2017-09-30T19:27:03Z>30/09/17 +19:27:03</time> ##C## ##A## ##b## <97> 180 notes Irrelevant text." ); my %counts; my %overallcounts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; #find the dates if ($mycorpus{$filename} =~ /(\d{4}-\d{2}-\d{2})T/g){ $date = $1; } #find only the relevant text while ($mycorpus{$filename} =~ /##(.*)##/g){ $hashtags = $1; #split text into words my @words = split /\W+/, $hashtags; foreach my $word (@words){ if ($word =~ /(\w+)/gi){ $word =~ tr/A-Z/a-z/; $counts{$date}{$word}++; $overallcounts{$date}++; #new hash to help with re +lative frequency overall count per day } } } } print Dumper \%counts;
The current output is as follows:
$VAR1 = { '2017-09-04' => { 'b' => 3, 'd' => 1, 'c' => 2, 'a' => 5 }, '2017-09-30' => { 'b' => 2, 'c' => 1, 'a' => 4 } };
However, my expected output would be something like this (I've rounded to 2 d.p, but this is not necessary for the final results).
$VAR1 = { '2017-09-04' => { 'b' => 27.27, 'd' => 9.09, 'c' => 18.18, 'a' => 45.45 }, '2017-09-30' => { 'b' => 28.57, 'c' => 14.29, 'a' => 57.14 } };
However, I am really struggling to work out how to do this. I can't picture how this script would look and I've been unable to find any relevant examples online.
I think that I might need to utilise a second hash, where the dates are the keys and the total number of words per day are the values (and I've done this in the example above). I would then need to divide each of the values in my original hash of hashes by the corresponding values in this new hash, and then multiply this new number by 100. The returned value would then need to be stored in my hash of hashes.
However, while I know the steps that I need to take in theory, I have really reached my current 'Perl-knowledge' limit, and I would really appreciate any advice on how to implement this final part of my code.
In reply to Calculations using values from two different hashes by Maire
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |