comment on

"Any guidance would be very much appreciated."

I'm going to assume that your "corpus" is the same, or similar, data to what we discussed in "Re: Storing output of a subroutine into an hash and then printing hash":

I said:

"... it sounds like there's a lot of data, and %corpus may contain hundreds (thousands? millions?) of key-value pairs. ... I would recommend that you at least consider returning a hashref from &getCorpus, instead of a hash."

You replied:

"... eventually I will be working with millions of key-value pairs, so I will attempt to use the hashref method."

In the example code below, I have used a hashref. I do recommend that you implement that sooner, rather than later. Changing &getCorpus is ridiculously simple and only involves adding one character: "return %corpus;" → "return \%corpus;". Changing all the code that uses that will likely be a lot more work. If you're writing a number of utilities to perform different tasks with this "corpus" data — I'm getting that impression but don't really know — multiply "a lot more work" by "number of utilities".

Bearing in mind the size of the data, you should be aiming to streamline your code. You are currently performing a lot of tasks that are quite unecessary: these could noticeably slow your application. Here's some suggestions for things to change:

You have "foreach my $filename (sort keys %mycorpus) { ... }".
- Your code indicates you have no interest in the order of the keys: remove sort.
- Your code indicates you have no interest in the keys themselves. You only use them to access the values ($mycorpus{$filename}) in a couple of places. Skip over all that unnecessary processing by changing keys to values: now you can iterate just the data you want to work with.
You have if conditions using regexes with the 'g' modifier. In this particular context, that modifier is pointless: remove it. See "perlre: Modifiers" for more about that.
You have made liberal use of lookaround assertions (see "perlre: Lookaround Assertions"). These are all extra work for the regex engine and none are actually required here. See the regexes in my example script below which doesn't use these assertions.

There's another issue of data validation. You have, in multiple places, code like:

my $var;
if ($string =~ /$capture_regex/) {
    $var = $1;
}
[download]

Processing now continues with a potentially uninitialised $var. This could easily cause problems downstream: possibly difficult to debug.

If &getCorpus performs validation and guarantees the data it returns, you could just write:

my ($var) = $string =~ /$capture_regex/;
[download]

If you do need to validate it yourself, but simply want to skip invalid data, don't continue processing. Instead, you can do something along these lines:

next unless $string =~ /$capture_regex/;
my $var = $1;
[download]

You may want to validate, report issues, then skip the remainder of the current iteration:

my $var;
if ($string =~ /$capture_regex/) {
    $var = $1;
}
else {
    # ... issue warning, make a log entry, or whatever ...
    next;
}
[download]

"... finds the word soft (or softest, softer etc.) ..."

Whether that's a word you want to find in your real data, or just an example for test purposes, you may want to consider if "/\bsoft/" is sufficient. My system's dictionary has 30 words containing "soft", 22 of which start with "soft":

$ grep soft /usr/share/dict/words | wc -l
30
$ grep ^soft /usr/share/dict/words | wc -l
22
[download]

Do you want to exclude words like "semisoft"? Do you want to include words like "softball"? You may want to create a whitelist so that you know exactly what you're matching; perhaps something along these lines:

/ \b (?: soft | softer | softest ) \b /ix
[download]

Here's a minimal script that covers all the points I've raised.

#!/usr/bin/env perl

use strict;
use warnings;

my $date_re = qr{(?x: ^ <time \s datetime= ( \d{4} - \d{2} - \d{2} ) )
+};
my $want_re = qr{(?sx: [#][#] ( .*? ) [#][#] )};
my $soft_re = qr{(?ix: \b ( soft ) )};

my %count_for_date;

for (values %{ get_corpus() }) {
    my ($date) = /$date_re/;
    my ($want) = /$want_re/;
    $count_for_date{$date}++ while $want =~ /$soft_re/g;
}

# For testing only:
use Data::Dump;
dd \%count_for_date;

sub get_corpus {
    my %corpus = (
        fileA => "<time datetime=2017-09-01...
            soft soft
            soft
            ##... soft soft ... soft ...##
            hard
            soft
        ",
        fileB => "<time datetime=2017-09-01...
            soft
            ##... hard ...##
            soft
        ",
        fileC => "<time datetime=2017-09-01...
            ##... softball ...##
        ",
        fileD => "<time datetime=2017-09-02...
            ##... semisoft ...##
        ",
        fileE => "<time datetime=2017-09-03...
            ##... soft softer softest softly soften softner ...##
        ",
        fileF => "<time datetime=2017-09-04...
            ##
                soft softer softest 
                Soft Softer Softest 
                softly soften softner 
                Softly Soften Softner 
            ##
        ",
    );

    return \%corpus;
}
[download]

Output:

{ "2017-09-01" => 4, "2017-09-03" => 6, "2017-09-04" => 12 }
[download]

If you also wanted dates with zero matches, you can add a line like this after you capture $date:

$count_for_date{$date} ||= 0;
[download]

The output then becomes:

{ "2017-09-01" => 4, "2017-09-02" => 0, "2017-09-03" => 6, "2017-09-04
+" => 12 }
[download]

— Ken

In reply to Re: Counting instances of a string in certain sections of files within a hash by kcott
in thread Counting instances of a string in certain sections of files within a hash by Maire

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.