G'day Maire,

"Any guidance would be very much appreciated."

I'm going to assume that your "corpus" is the same, or similar, data to what we discussed in "Re: Storing output of a subroutine into an hash and then printing hash":

I said:

"... it sounds like there's a lot of data, and %corpus may contain hundreds (thousands? millions?) of key-value pairs. ... I would recommend that you at least consider returning a hashref from &getCorpus, instead of a hash."

You replied:

"... eventually I will be working with millions of key-value pairs, so I will attempt to use the hashref method."

In the example code below, I have used a hashref. I do recommend that you implement that sooner, rather than later. Changing &getCorpus is ridiculously simple and only involves adding one character: "return %corpus;" → "return \%corpus;". Changing all the code that uses that will likely be a lot more work. If you're writing a number of utilities to perform different tasks with this "corpus" data — I'm getting that impression but don't really know — multiply "a lot more work" by "number of utilities".

Bearing in mind the size of the data, you should be aiming to streamline your code. You are currently performing a lot of tasks that are quite unecessary: these could noticeably slow your application. Here's some suggestions for things to change:

There's another issue of data validation. You have, in multiple places, code like:

my $var; if ($string =~ /$capture_regex/) { $var = $1; }

Processing now continues with a potentially uninitialised $var. This could easily cause problems downstream: possibly difficult to debug.

If &getCorpus performs validation and guarantees the data it returns, you could just write:

my ($var) = $string =~ /$capture_regex/;

If you do need to validate it yourself, but simply want to skip invalid data, don't continue processing. Instead, you can do something along these lines:

next unless $string =~ /$capture_regex/; my $var = $1;

You may want to validate, report issues, then skip the remainder of the current iteration:

my $var; if ($string =~ /$capture_regex/) { $var = $1; } else { # ... issue warning, make a log entry, or whatever ... next; }
"... finds the word soft (or softest, softer etc.) ..."

Whether that's a word you want to find in your real data, or just an example for test purposes, you may want to consider if "/\bsoft/" is sufficient. My system's dictionary has 30 words containing "soft", 22 of which start with "soft":

$ grep soft /usr/share/dict/words | wc -l 30 $ grep ^soft /usr/share/dict/words | wc -l 22

Do you want to exclude words like "semisoft"? Do you want to include words like "softball"? You may want to create a whitelist so that you know exactly what you're matching; perhaps something along these lines:

/ \b (?: soft | softer | softest ) \b /ix

Here's a minimal script that covers all the points I've raised.

#!/usr/bin/env perl use strict; use warnings; my $date_re = qr{(?x: ^ <time \s datetime= ( \d{4} - \d{2} - \d{2} ) ) +}; my $want_re = qr{(?sx: [#][#] ( .*? ) [#][#] )}; my $soft_re = qr{(?ix: \b ( soft ) )}; my %count_for_date; for (values %{ get_corpus() }) { my ($date) = /$date_re/; my ($want) = /$want_re/; $count_for_date{$date}++ while $want =~ /$soft_re/g; } # For testing only: use Data::Dump; dd \%count_for_date; sub get_corpus { my %corpus = ( fileA => "<time datetime=2017-09-01... soft soft soft ##... soft soft ... soft ...## hard soft ", fileB => "<time datetime=2017-09-01... soft ##... hard ...## soft ", fileC => "<time datetime=2017-09-01... ##... softball ...## ", fileD => "<time datetime=2017-09-02... ##... semisoft ...## ", fileE => "<time datetime=2017-09-03... ##... soft softer softest softly soften softner ...## ", fileF => "<time datetime=2017-09-04... ## soft softer softest Soft Softer Softest softly soften softner Softly Soften Softner ## ", ); return \%corpus; }

Output:

{ "2017-09-01" => 4, "2017-09-03" => 6, "2017-09-04" => 12 }

If you also wanted dates with zero matches, you can add a line like this after you capture $date:

$count_for_date{$date} ||= 0;

The output then becomes:

{ "2017-09-01" => 4, "2017-09-02" => 0, "2017-09-03" => 6, "2017-09-04 +" => 12 }

— Ken


In reply to Re: Counting instances of a string in certain sections of files within a hash by kcott
in thread Counting instances of a string in certain sections of files within a hash by Maire

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.