Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm trying to write a script which returns the frequency of a given word (per day) in certain extracts of a very large number of files from a single folder.

So, I have text files which begin with a timestamp and contain a lot of irrelevant metadata, but all of the files have text that I'm interested in enclosed between two hashtags.

e.g
<time datetime=2017-09-03T23:17:53Z> Irrelevant text. Irrelevent text. More text. ##This is the data that I am interested in.## More text. More text.

I have a script which, when run, returns the frequency of a given word per day in all of the text files from a given folder:

So, this script sorts through a hash (compiled using a function which is not reproduced here because it is not publicly available). I first get the datestamps (which then serve as the keys for a new hash) and then I say that the count value should be increased each time the script finds the word soft (or softest, softer etc.) being used. I then print out the datestamp and the freqency of usage of soft etc. per day

my %mycorpus = getCorpus('C:\Users\li\test4'); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $tags; my $comments; my $word; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /\bsoft/gi){ $counts{$date}++; } } foreach my $date (sort keys %counts){ print "$date = $counts{$date}\n"; }
The OUTPUT is:
2017-08-31 = 42 2017-09-01 = 25 2017-09-02 = 40 2017-09-03 = 34 2017-09-04 = 26 .....

This script works as expected and the values/counts that are printed are accurate (i.e. "soft" WAS used 42 times on 31/08/2017). However, this script returns the frequency of "soft" in all of the text, and I'm only interested in how many times it occurs within the "hashtagged" sections. I have attempted to return only the frequency of the word in the hashtagged section using the following script:

my %mycorpus = getCorpus('C:\Users\li\test4'); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $tags; my $comments; my $word; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ + $date = $1; } if ($mycorpus{$filename} =~ /(?<=##)(.*)(?=##)/g){ $hashtags = $1; } if ($hashtags =~ /\bsoft/gi){ $counts{$date}++; } } foreach my $date (sort keys %counts){ print "$date = $counts{$date}\n"; }

After a bit of messing about, I have managed to get it to produce output, but the count values are a lot lower then they should be:

2017-09-01 = 1 2017-09-02 = 3 2017-09-03 = 6 2017-09-04 = 2 2017-09-05 = 1 2017-09-06 = 3 2017-09-07 = 3

For instance, on 2017-09-04, there are at least ten instances of "soft" in the text matched by the regex, but only "2" have been counted.

I've rewritten this script dozens of times, but I can't seem to spot my mistake(s). Any guidance would be very much appreciated.

Replies are listed 'Best First'.
Re: Counting instances of a string in certain sections of files within a hash
by hippo (Archbishop) on Oct 31, 2017 at 17:03 UTC
    I've rewritten this script dozens of times, but I can't seem to spot my mistake

    There's no SSCCE in there. Creating one shows the mistake quite clearly:

    use strict; use warnings; use Test::More tests => 2; my %mycorpus = ( a => "<time datetime=2017-09-03T23:17:53Z> blah blah ##soft and softly is as softly does## bar", b => "<time datetime=2017-09-03T23:17:53Z> blah blah ##Not so SOFT now, eh?## foo", c => "<time datetime=2017-09-04T23:17:53Z> blah ##Mr. Soft in the soft-play area## baz" ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /(?<=##)(.*)(?=##)/g){ $hashtags = $1; } if ($hashtags =~ /\bsoft/gi){ $counts{$date}++; } } is ($counts{'2017-09-03'}, 4, "2017-09-03 tally correct"); is ($counts{'2017-09-04'}, 2, "2017-09-04 tally correct");

    Your assignment to $counts{$date} is wrong - it only adds one count irrespective of how many matches there are in that line/file. Here's the fixed version:

    use strict; use warnings; use Test::More tests => 2; my %mycorpus = ( a => "<time datetime=2017-09-03T23:17:53Z> blah blah ##soft and softly is as softly does## bar", b => "<time datetime=2017-09-03T23:17:53Z> blah blah ##Not so SOFT now, eh?## foo", c => "<time datetime=2017-09-04T23:17:53Z> blah ##Mr. Soft in the soft-play area## baz" ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /(?<=##)(.*)(?=##)/g){ $hashtags = $1; } if (my $matches =()= $hashtags =~ /\bsoft/gi){ $counts{$date} += $matches; } } is ($counts{'2017-09-03'}, 4, "2017-09-03 tally correct"); is ($counts{'2017-09-04'}, 2, "2017-09-04 tally correct");

      Hi Hippo,

      This is incredibly helpful, thank you! I've tried substituting your data with a few extracts of my own in the 'in-script' example, and it worked brilliantly, passed the tests, and printed out exactly what I expected when I added that function to the script.

      Unfortunately, the problems arise when I try to use the script with the hash created by my getCorpus function, as illustrated by the output here:

      1..2 2017-09-04 = 1 not ok 1 - 2017-09-04 tally correct # Failed test '2017-09-04 tally correct' # at C:\Users\li\test9.pl line 67. # got: '1' # expected: '3' not ok 2 - 2017-09-30 tally correct # Failed test '2017-09-30 tally correct' # at C:\Users\li\test9.pl line 68. # got: undef # expected: '2' # Looks like you failed 2 tests of 2.

      It looks like, given that it didn't even count any of the two instances of the word in my 2017-09-30 documents, my problems may run deeper than I first believed. I think I might need to go back and check that the function is actually getting everything that I expect it to from the given folders!

      Thanks again for your help though and for illustrating the use of the "Test" module: I think that might come in useful!

      UPDATE: It appears that the function is actually working okay, after all! I've used Data:Dumper to print out the entire contents of the hash discussed above, and it prints out everything that I expected it to, suggesting that the problem lies elsewhere. I'm going to double check the regex next and make sure that it always captures what I expect it to.

      UPDATE2: Okay, it looks like the problem may be something to do with the regex. I put aside trying to count the words and instead just tried to print out the content in the hash that matched my regex. I ran the following script (I know that this is probably a horribly complex way to do this, but the script was already in my index catalogue!):

      use Data::Dumper; open (OUT1, ">alldata.txt") or die; %mycorpus = getCorpus('C:\Users\li\test11'); my $href = \%mycorpus; # reference to the hash %mycorpus print OUT1 Dumper $href; # note the order will differ from + that above print OUT1 "\n"; open (OUT2, ">editeddata.txt") or die; open( FILE, "alldata.txt" ) || die "couldn't open\n"; $/ = undef; while (<FILE>) { while (/(?<=##)(.*)(?=##)/g) { print OUT2 "$1\n"; }

      So the first part of the code prints out the entire contents of the hash into a text file and the second part opens that file and prints out (into a second file) only the lines that match the regular expression. In total, 35 lines should have been printed into the second file, but instead, only 22 were.

      There is nothing that distinguishes the missed lines from the captured ones: they are all formatted in exactly the same way and all are captured when the regex is run on e.g. https://regex101.com/. Moreover, if I run a very simple programme below that prints out the regex matches from a single file, the "missed" lines are captured:

      open(FILE, 'C:\Users\li\test11\164949.txt'); while (<FILE>) { if ( /(?<=##)(.*)(?=##)/g ) { print "$1\n"; } }

      In other words, the lines that are not being captured do match the regex and should be being printed. It looks to me like this is probably the root of my problems, and, if I fix this, then it should start counting the frequency of my words properly!

        $/ = undef;

        Can you explain why you are doing this? I don't see the need and it might complicate matters. I also don't immediately see the need for the look-arounds in the regex. Perhaps if you could provide an SSCCE with a small example of the failing data (a couple of lines should suffice) we might be able to suggest something.

Re: Counting instances of a string in certain sections of files within a hash
by kcott (Archbishop) on Oct 31, 2017 at 21:33 UTC

    G'day Maire,

    "Any guidance would be very much appreciated."

    I'm going to assume that your "corpus" is the same, or similar, data to what we discussed in "Re: Storing output of a subroutine into an hash and then printing hash":

    I said:

    "... it sounds like there's a lot of data, and %corpus may contain hundreds (thousands? millions?) of key-value pairs. ... I would recommend that you at least consider returning a hashref from &getCorpus, instead of a hash."

    You replied:

    "... eventually I will be working with millions of key-value pairs, so I will attempt to use the hashref method."

    In the example code below, I have used a hashref. I do recommend that you implement that sooner, rather than later. Changing &getCorpus is ridiculously simple and only involves adding one character: "return %corpus;" → "return \%corpus;". Changing all the code that uses that will likely be a lot more work. If you're writing a number of utilities to perform different tasks with this "corpus" data — I'm getting that impression but don't really know — multiply "a lot more work" by "number of utilities".

    Bearing in mind the size of the data, you should be aiming to streamline your code. You are currently performing a lot of tasks that are quite unecessary: these could noticeably slow your application. Here's some suggestions for things to change:

    • You have "foreach my $filename (sort keys %mycorpus) { ... }".
      • Your code indicates you have no interest in the order of the keys: remove sort.
      • Your code indicates you have no interest in the keys themselves. You only use them to access the values ($mycorpus{$filename}) in a couple of places. Skip over all that unnecessary processing by changing keys to values: now you can iterate just the data you want to work with.
    • You have if conditions using regexes with the 'g' modifier. In this particular context, that modifier is pointless: remove it. See "perlre: Modifiers" for more about that.
    • You have made liberal use of lookaround assertions (see "perlre: Lookaround Assertions"). These are all extra work for the regex engine and none are actually required here. See the regexes in my example script below which doesn't use these assertions.

    There's another issue of data validation. You have, in multiple places, code like:

    my $var; if ($string =~ /$capture_regex/) { $var = $1; }

    Processing now continues with a potentially uninitialised $var. This could easily cause problems downstream: possibly difficult to debug.

    If &getCorpus performs validation and guarantees the data it returns, you could just write:

    my ($var) = $string =~ /$capture_regex/;

    If you do need to validate it yourself, but simply want to skip invalid data, don't continue processing. Instead, you can do something along these lines:

    next unless $string =~ /$capture_regex/; my $var = $1;

    You may want to validate, report issues, then skip the remainder of the current iteration:

    my $var; if ($string =~ /$capture_regex/) { $var = $1; } else { # ... issue warning, make a log entry, or whatever ... next; }
    "... finds the word soft (or softest, softer etc.) ..."

    Whether that's a word you want to find in your real data, or just an example for test purposes, you may want to consider if "/\bsoft/" is sufficient. My system's dictionary has 30 words containing "soft", 22 of which start with "soft":

    $ grep soft /usr/share/dict/words | wc -l 30 $ grep ^soft /usr/share/dict/words | wc -l 22

    Do you want to exclude words like "semisoft"? Do you want to include words like "softball"? You may want to create a whitelist so that you know exactly what you're matching; perhaps something along these lines:

    / \b (?: soft | softer | softest ) \b /ix

    Here's a minimal script that covers all the points I've raised.

    #!/usr/bin/env perl use strict; use warnings; my $date_re = qr{(?x: ^ <time \s datetime= ( \d{4} - \d{2} - \d{2} ) ) +}; my $want_re = qr{(?sx: [#][#] ( .*? ) [#][#] )}; my $soft_re = qr{(?ix: \b ( soft ) )}; my %count_for_date; for (values %{ get_corpus() }) { my ($date) = /$date_re/; my ($want) = /$want_re/; $count_for_date{$date}++ while $want =~ /$soft_re/g; } # For testing only: use Data::Dump; dd \%count_for_date; sub get_corpus { my %corpus = ( fileA => "<time datetime=2017-09-01... soft soft soft ##... soft soft ... soft ...## hard soft ", fileB => "<time datetime=2017-09-01... soft ##... hard ...## soft ", fileC => "<time datetime=2017-09-01... ##... softball ...## ", fileD => "<time datetime=2017-09-02... ##... semisoft ...## ", fileE => "<time datetime=2017-09-03... ##... soft softer softest softly soften softner ...## ", fileF => "<time datetime=2017-09-04... ## soft softer softest Soft Softer Softest softly soften softner Softly Soften Softner ## ", ); return \%corpus; }

    Output:

    { "2017-09-01" => 4, "2017-09-03" => 6, "2017-09-04" => 12 }

    If you also wanted dates with zero matches, you can add a line like this after you capture $date:

    $count_for_date{$date} ||= 0;

    The output then becomes:

    { "2017-09-01" => 4, "2017-09-02" => 0, "2017-09-03" => 6, "2017-09-04 +" => 12 }

    — Ken

      Wow, thank you so much for all of this and especially for your very clear explanations of the changes/improvements that you have made to the original script.

      Your assumptions that the data is the same as that from our previous discussion and also that I will be performing numerous tasks using this data are correct. I've now implemented the "return \%corpus;" change across the scripts that use that function, thanks!

      Unfortunately, I haven't yet been able to get this improved script to produce any output when I change the hash from the example one to the hash created by my getCorpus function. However, as I speculated in my reply to hippo above, I suspect that this may be something to do with the function itself. I'm going to run a few tests on small folders of data, to see if I can spot the exact problem.

      Thanks again for all of this!

      UPDATE: Ah, the script wasn't producing any output with my function because I was using incorrect syntax to print! I've now got output but, unfortunately, the count values are still lower than they should be. Interestingly, however, they are reporting the same frequencies as the version of this script that hippo helped me with:

      { "2017-09-04" => 1 } not ok 1 - 2017-09-04 tally correct # Failed test '2017-09-04 tally correct' # at C:\Users\lisad\PhD\perl\test11.pl line 62. # got: '1' # expected: '3' not ok 2 - 2017-09-30 tally correct # Failed test '2017-09-30 tally correct' # at C:\Users\lisad\PhD\perl\test11.pl line 63. # got: undef # expected: '2' # Looks like you failed 2 tests of 2.