G'day Maire,
"Any guidance would be very much appreciated."
I'm going to assume that your "corpus" is the same, or similar, data to what we discussed in "Re: Storing output of a subroutine into an hash and then printing hash":
I said:
"... it sounds like there's a lot of data, and %corpus may contain hundreds (thousands? millions?) of key-value pairs. ... I would recommend that you at least consider returning a hashref from &getCorpus, instead of a hash."
You replied:
"... eventually I will be working with millions of key-value pairs, so I will attempt to use the hashref method."
In the example code below, I have used a hashref. I do recommend that you implement that sooner, rather than later. Changing &getCorpus is ridiculously simple and only involves adding one character: "return %corpus;" → "return \%corpus;". Changing all the code that uses that will likely be a lot more work. If you're writing a number of utilities to perform different tasks with this "corpus" data — I'm getting that impression but don't really know — multiply "a lot more work" by "number of utilities".
Bearing in mind the size of the data, you should be aiming to streamline your code. You are currently performing a lot of tasks that are quite unecessary: these could noticeably slow your application. Here's some suggestions for things to change:
There's another issue of data validation. You have, in multiple places, code like:
my $var; if ($string =~ /$capture_regex/) { $var = $1; }
Processing now continues with a potentially uninitialised $var. This could easily cause problems downstream: possibly difficult to debug.
If &getCorpus performs validation and guarantees the data it returns, you could just write:
my ($var) = $string =~ /$capture_regex/;
If you do need to validate it yourself, but simply want to skip invalid data, don't continue processing. Instead, you can do something along these lines:
next unless $string =~ /$capture_regex/; my $var = $1;
You may want to validate, report issues, then skip the remainder of the current iteration:
my $var; if ($string =~ /$capture_regex/) { $var = $1; } else { # ... issue warning, make a log entry, or whatever ... next; }
"... finds the word soft (or softest, softer etc.) ..."
Whether that's a word you want to find in your real data, or just an example for test purposes, you may want to consider if "/\bsoft/" is sufficient. My system's dictionary has 30 words containing "soft", 22 of which start with "soft":
$ grep soft /usr/share/dict/words | wc -l 30 $ grep ^soft /usr/share/dict/words | wc -l 22
Do you want to exclude words like "semisoft"? Do you want to include words like "softball"? You may want to create a whitelist so that you know exactly what you're matching; perhaps something along these lines:
/ \b (?: soft | softer | softest ) \b /ix
Here's a minimal script that covers all the points I've raised.
#!/usr/bin/env perl use strict; use warnings; my $date_re = qr{(?x: ^ <time \s datetime= ( \d{4} - \d{2} - \d{2} ) ) +}; my $want_re = qr{(?sx: [#][#] ( .*? ) [#][#] )}; my $soft_re = qr{(?ix: \b ( soft ) )}; my %count_for_date; for (values %{ get_corpus() }) { my ($date) = /$date_re/; my ($want) = /$want_re/; $count_for_date{$date}++ while $want =~ /$soft_re/g; } # For testing only: use Data::Dump; dd \%count_for_date; sub get_corpus { my %corpus = ( fileA => "<time datetime=2017-09-01... soft soft soft ##... soft soft ... soft ...## hard soft ", fileB => "<time datetime=2017-09-01... soft ##... hard ...## soft ", fileC => "<time datetime=2017-09-01... ##... softball ...## ", fileD => "<time datetime=2017-09-02... ##... semisoft ...## ", fileE => "<time datetime=2017-09-03... ##... soft softer softest softly soften softner ...## ", fileF => "<time datetime=2017-09-04... ## soft softer softest Soft Softer Softest softly soften softner Softly Soften Softner ## ", ); return \%corpus; }
Output:
{ "2017-09-01" => 4, "2017-09-03" => 6, "2017-09-04" => 12 }
If you also wanted dates with zero matches, you can add a line like this after you capture $date:
$count_for_date{$date} ||= 0;
The output then becomes:
{ "2017-09-01" => 4, "2017-09-02" => 0, "2017-09-03" => 6, "2017-09-04 +" => 12 }
— Ken
In reply to Re: Counting instances of a string in certain sections of files within a hash
by kcott
in thread Counting instances of a string in certain sections of files within a hash
by Maire
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |