in reply to Counting instances of a string in certain sections of files within a hash

G'day Maire,

"Any guidance would be very much appreciated."

I'm going to assume that your "corpus" is the same, or similar, data to what we discussed in "Re: Storing output of a subroutine into an hash and then printing hash":

I said:

"... it sounds like there's a lot of data, and %corpus may contain hundreds (thousands? millions?) of key-value pairs. ... I would recommend that you at least consider returning a hashref from &getCorpus, instead of a hash."

You replied:

"... eventually I will be working with millions of key-value pairs, so I will attempt to use the hashref method."

In the example code below, I have used a hashref. I do recommend that you implement that sooner, rather than later. Changing &getCorpus is ridiculously simple and only involves adding one character: "return %corpus;" → "return \%corpus;". Changing all the code that uses that will likely be a lot more work. If you're writing a number of utilities to perform different tasks with this "corpus" data — I'm getting that impression but don't really know — multiply "a lot more work" by "number of utilities".

Bearing in mind the size of the data, you should be aiming to streamline your code. You are currently performing a lot of tasks that are quite unecessary: these could noticeably slow your application. Here's some suggestions for things to change:

There's another issue of data validation. You have, in multiple places, code like:

my $var; if ($string =~ /$capture_regex/) { $var = $1; }

Processing now continues with a potentially uninitialised $var. This could easily cause problems downstream: possibly difficult to debug.

If &getCorpus performs validation and guarantees the data it returns, you could just write:

my ($var) = $string =~ /$capture_regex/;

If you do need to validate it yourself, but simply want to skip invalid data, don't continue processing. Instead, you can do something along these lines:

next unless $string =~ /$capture_regex/; my $var = $1;

You may want to validate, report issues, then skip the remainder of the current iteration:

my $var; if ($string =~ /$capture_regex/) { $var = $1; } else { # ... issue warning, make a log entry, or whatever ... next; }
"... finds the word soft (or softest, softer etc.) ..."

Whether that's a word you want to find in your real data, or just an example for test purposes, you may want to consider if "/\bsoft/" is sufficient. My system's dictionary has 30 words containing "soft", 22 of which start with "soft":

$ grep soft /usr/share/dict/words | wc -l 30 $ grep ^soft /usr/share/dict/words | wc -l 22

Do you want to exclude words like "semisoft"? Do you want to include words like "softball"? You may want to create a whitelist so that you know exactly what you're matching; perhaps something along these lines:

/ \b (?: soft | softer | softest ) \b /ix

Here's a minimal script that covers all the points I've raised.

#!/usr/bin/env perl use strict; use warnings; my $date_re = qr{(?x: ^ <time \s datetime= ( \d{4} - \d{2} - \d{2} ) ) +}; my $want_re = qr{(?sx: [#][#] ( .*? ) [#][#] )}; my $soft_re = qr{(?ix: \b ( soft ) )}; my %count_for_date; for (values %{ get_corpus() }) { my ($date) = /$date_re/; my ($want) = /$want_re/; $count_for_date{$date}++ while $want =~ /$soft_re/g; } # For testing only: use Data::Dump; dd \%count_for_date; sub get_corpus { my %corpus = ( fileA => "<time datetime=2017-09-01... soft soft soft ##... soft soft ... soft ...## hard soft ", fileB => "<time datetime=2017-09-01... soft ##... hard ...## soft ", fileC => "<time datetime=2017-09-01... ##... softball ...## ", fileD => "<time datetime=2017-09-02... ##... semisoft ...## ", fileE => "<time datetime=2017-09-03... ##... soft softer softest softly soften softner ...## ", fileF => "<time datetime=2017-09-04... ## soft softer softest Soft Softer Softest softly soften softner Softly Soften Softner ## ", ); return \%corpus; }

Output:

{ "2017-09-01" => 4, "2017-09-03" => 6, "2017-09-04" => 12 }

If you also wanted dates with zero matches, you can add a line like this after you capture $date:

$count_for_date{$date} ||= 0;

The output then becomes:

{ "2017-09-01" => 4, "2017-09-02" => 0, "2017-09-03" => 6, "2017-09-04 +" => 12 }

— Ken

Replies are listed 'Best First'.
Re^2: Counting instances of a string in certain sections of files within a hash
by Maire (Scribe) on Nov 01, 2017 at 10:26 UTC

    Wow, thank you so much for all of this and especially for your very clear explanations of the changes/improvements that you have made to the original script.

    Your assumptions that the data is the same as that from our previous discussion and also that I will be performing numerous tasks using this data are correct. I've now implemented the "return \%corpus;" change across the scripts that use that function, thanks!

    Unfortunately, I haven't yet been able to get this improved script to produce any output when I change the hash from the example one to the hash created by my getCorpus function. However, as I speculated in my reply to hippo above, I suspect that this may be something to do with the function itself. I'm going to run a few tests on small folders of data, to see if I can spot the exact problem.

    Thanks again for all of this!

    UPDATE: Ah, the script wasn't producing any output with my function because I was using incorrect syntax to print! I've now got output but, unfortunately, the count values are still lower than they should be. Interestingly, however, they are reporting the same frequencies as the version of this script that hippo helped me with:

    { "2017-09-04" => 1 } not ok 1 - 2017-09-04 tally correct # Failed test '2017-09-04 tally correct' # at C:\Users\lisad\PhD\perl\test11.pl line 62. # got: '1' # expected: '3' not ok 2 - 2017-09-30 tally correct # Failed test '2017-09-30 tally correct' # at C:\Users\lisad\PhD\perl\test11.pl line 63. # got: undef # expected: '2' # Looks like you failed 2 tests of 2.