in reply to Counting instances of a string in certain sections of files within a hash

I've rewritten this script dozens of times, but I can't seem to spot my mistake

There's no SSCCE in there. Creating one shows the mistake quite clearly:

use strict; use warnings; use Test::More tests => 2; my %mycorpus = ( a => "<time datetime=2017-09-03T23:17:53Z> blah blah ##soft and softly is as softly does## bar", b => "<time datetime=2017-09-03T23:17:53Z> blah blah ##Not so SOFT now, eh?## foo", c => "<time datetime=2017-09-04T23:17:53Z> blah ##Mr. Soft in the soft-play area## baz" ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /(?<=##)(.*)(?=##)/g){ $hashtags = $1; } if ($hashtags =~ /\bsoft/gi){ $counts{$date}++; } } is ($counts{'2017-09-03'}, 4, "2017-09-03 tally correct"); is ($counts{'2017-09-04'}, 2, "2017-09-04 tally correct");

Your assignment to $counts{$date} is wrong - it only adds one count irrespective of how many matches there are in that line/file. Here's the fixed version:

use strict; use warnings; use Test::More tests => 2; my %mycorpus = ( a => "<time datetime=2017-09-03T23:17:53Z> blah blah ##soft and softly is as softly does## bar", b => "<time datetime=2017-09-03T23:17:53Z> blah blah ##Not so SOFT now, eh?## foo", c => "<time datetime=2017-09-04T23:17:53Z> blah ##Mr. Soft in the soft-play area## baz" ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /(?<=##)(.*)(?=##)/g){ $hashtags = $1; } if (my $matches =()= $hashtags =~ /\bsoft/gi){ $counts{$date} += $matches; } } is ($counts{'2017-09-03'}, 4, "2017-09-03 tally correct"); is ($counts{'2017-09-04'}, 2, "2017-09-04 tally correct");

Replies are listed 'Best First'.
Re^2: Counting instances of a string in certain sections of files within a hash
by Maire (Scribe) on Nov 01, 2017 at 10:11 UTC

    Hi Hippo,

    This is incredibly helpful, thank you! I've tried substituting your data with a few extracts of my own in the 'in-script' example, and it worked brilliantly, passed the tests, and printed out exactly what I expected when I added that function to the script.

    Unfortunately, the problems arise when I try to use the script with the hash created by my getCorpus function, as illustrated by the output here:

    1..2 2017-09-04 = 1 not ok 1 - 2017-09-04 tally correct # Failed test '2017-09-04 tally correct' # at C:\Users\li\test9.pl line 67. # got: '1' # expected: '3' not ok 2 - 2017-09-30 tally correct # Failed test '2017-09-30 tally correct' # at C:\Users\li\test9.pl line 68. # got: undef # expected: '2' # Looks like you failed 2 tests of 2.

    It looks like, given that it didn't even count any of the two instances of the word in my 2017-09-30 documents, my problems may run deeper than I first believed. I think I might need to go back and check that the function is actually getting everything that I expect it to from the given folders!

    Thanks again for your help though and for illustrating the use of the "Test" module: I think that might come in useful!

    UPDATE: It appears that the function is actually working okay, after all! I've used Data:Dumper to print out the entire contents of the hash discussed above, and it prints out everything that I expected it to, suggesting that the problem lies elsewhere. I'm going to double check the regex next and make sure that it always captures what I expect it to.

    UPDATE2: Okay, it looks like the problem may be something to do with the regex. I put aside trying to count the words and instead just tried to print out the content in the hash that matched my regex. I ran the following script (I know that this is probably a horribly complex way to do this, but the script was already in my index catalogue!):

    use Data::Dumper; open (OUT1, ">alldata.txt") or die; %mycorpus = getCorpus('C:\Users\li\test11'); my $href = \%mycorpus; # reference to the hash %mycorpus print OUT1 Dumper $href; # note the order will differ from + that above print OUT1 "\n"; open (OUT2, ">editeddata.txt") or die; open( FILE, "alldata.txt" ) || die "couldn't open\n"; $/ = undef; while (<FILE>) { while (/(?<=##)(.*)(?=##)/g) { print OUT2 "$1\n"; }

    So the first part of the code prints out the entire contents of the hash into a text file and the second part opens that file and prints out (into a second file) only the lines that match the regular expression. In total, 35 lines should have been printed into the second file, but instead, only 22 were.

    There is nothing that distinguishes the missed lines from the captured ones: they are all formatted in exactly the same way and all are captured when the regex is run on e.g. https://regex101.com/. Moreover, if I run a very simple programme below that prints out the regex matches from a single file, the "missed" lines are captured:

    open(FILE, 'C:\Users\li\test11\164949.txt'); while (<FILE>) { if ( /(?<=##)(.*)(?=##)/g ) { print "$1\n"; } }

    In other words, the lines that are not being captured do match the regex and should be being printed. It looks to me like this is probably the root of my problems, and, if I fix this, then it should start counting the frequency of my words properly!

      $/ = undef;

      Can you explain why you are doing this? I don't see the need and it might complicate matters. I also don't immediately see the need for the look-arounds in the regex. Perhaps if you could provide an SSCCE with a small example of the failing data (a couple of lines should suffice) we might be able to suggest something.

        Thank you, I really appreciate your help with this!

        I was making a mistake with the code that I demonstrated in UPDATE 2 above. The second part of the script was (I think) attempting to print out the regex matches before all of the data had been "dumped" out of the hash and into alldata.txt. When I ran the second part of the code separate from the first, it successfully matched all of the data it was supposed to, demonstrating (I think) that the regex is not the problem here, either. Sorry for wasting your time with that: I should have double checked that my code was right before posting!

        I am, however, still having trouble getting the main code to count the instances of "soft" per day. I'm using the corrections that you very kindly made to my original script -- the only things that I've changed is that I've substituted your examples for my own data and I've also taken the lookarounds out of the regex, following your and kcott's advice:

        use strict; use warnings; use Test::More tests => 2; my %mycorpus = ( a => "<p><time datetime=2017-09-04T05:23:39Z>04/09/17 06:23: +39</time> Irrelevant text that may feature the word soft, +softest, or softly. ar##*whispers softly* don\'t## ##very soft## ##the softest even## — 164 notes", b => "p><time datetime=2017-09-30T18:20:56Z>30/09/17 19:20 +:56</time> Irrelevant text that may feature the word soft, softest, o +r softly. 4r##skam## rr##isak valtersen## rr##even bech næsheim## dr##god## r##they're so soft## sr##my heart is bursting## ##This is the softest## — 379 notes Irrelevant text that may feature the word soft, softest, +or softly.", c => "<p><time datetime=2017-09-04T05:27:03Z>04/09/17 06:2 +7:03</time> ##SKSNSKXBXKXND## r##I LOVE THESE## ##such soft boyfriend™## ##you're my sunshine## — 180 notes Irrelevant text that may feature the word soft, softest, or softly." ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /[#][#](.*)[#][#]/g){ $hashtags = $1; } if (my $matches =()= $hashtags =~ /\bsoft/gi){ $counts{$date} += $matches; } } is ($counts{'2017-09-04'}, 4, "2017-09-04 tally correct"); is ($counts{'2017-09-30'}, 2, "2017-09-30 tally correct");
        This script produces the following output:
        1..2 not ok 1 - 2017-09-03 tally correct # Failed test '2017-09-03 tally correct' # at C:\Users\li\test18.pl line 52. # got: undef # expected: '4' not ok 2 - 2017-09-04 tally correct # Failed test '2017-09-04 tally correct' # at C:\Users\li\test18.pl line 53. # got: '1' # expected: '2' # Looks like you failed 2 tests of 2.

        If it makes any difference, I think that it is the very first instance of "soft" (in the line "ar##*whispers softly* don\'t##" that it actually captures.

        Given that it worked fine in your examples, I think it is likely that I'm making a basic mistake or didn't convey something important about my data in my original post.