in reply to Re^2: Counting instances of a string in certain sections of files within a hash
in thread Counting instances of a string in certain sections of files within a hash

$/ = undef;

Can you explain why you are doing this? I don't see the need and it might complicate matters. I also don't immediately see the need for the look-arounds in the regex. Perhaps if you could provide an SSCCE with a small example of the failing data (a couple of lines should suffice) we might be able to suggest something.

  • Comment on Re^3: Counting instances of a string in certain sections of files within a hash
  • Download Code

Replies are listed 'Best First'.
Re^4: Counting instances of a string in certain sections of files within a hash
by Maire (Scribe) on Nov 02, 2017 at 16:03 UTC

    Thank you, I really appreciate your help with this!

    I was making a mistake with the code that I demonstrated in UPDATE 2 above. The second part of the script was (I think) attempting to print out the regex matches before all of the data had been "dumped" out of the hash and into alldata.txt. When I ran the second part of the code separate from the first, it successfully matched all of the data it was supposed to, demonstrating (I think) that the regex is not the problem here, either. Sorry for wasting your time with that: I should have double checked that my code was right before posting!

    I am, however, still having trouble getting the main code to count the instances of "soft" per day. I'm using the corrections that you very kindly made to my original script -- the only things that I've changed is that I've substituted your examples for my own data and I've also taken the lookarounds out of the regex, following your and kcott's advice:

    use strict; use warnings; use Test::More tests => 2; my %mycorpus = ( a => "<p><time datetime=2017-09-04T05:23:39Z>04/09/17 06:23: +39</time> Irrelevant text that may feature the word soft, +softest, or softly. ar##*whispers softly* don\'t## ##very soft## ##the softest even## — 164 notes", b => "p><time datetime=2017-09-30T18:20:56Z>30/09/17 19:20 +:56</time> Irrelevant text that may feature the word soft, softest, o +r softly. 4r##skam## rr##isak valtersen## rr##even bech næsheim## dr##god## r##they're so soft## sr##my heart is bursting## ##This is the softest## — 379 notes Irrelevant text that may feature the word soft, softest, +or softly.", c => "<p><time datetime=2017-09-04T05:27:03Z>04/09/17 06:2 +7:03</time> ##SKSNSKXBXKXND## r##I LOVE THESE## ##such soft boyfriend™## ##you're my sunshine## — 180 notes Irrelevant text that may feature the word soft, softest, or softly." ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /[#][#](.*)[#][#]/g){ $hashtags = $1; } if (my $matches =()= $hashtags =~ /\bsoft/gi){ $counts{$date} += $matches; } } is ($counts{'2017-09-04'}, 4, "2017-09-04 tally correct"); is ($counts{'2017-09-30'}, 2, "2017-09-30 tally correct");
    This script produces the following output:
    1..2 not ok 1 - 2017-09-03 tally correct # Failed test '2017-09-03 tally correct' # at C:\Users\li\test18.pl line 52. # got: undef # expected: '4' not ok 2 - 2017-09-04 tally correct # Failed test '2017-09-04 tally correct' # at C:\Users\li\test18.pl line 53. # got: '1' # expected: '2' # Looks like you failed 2 tests of 2.

    If it makes any difference, I think that it is the very first instance of "soft" (in the line "ar##*whispers softly* don\'t##" that it actually captures.

    Given that it worked fine in your examples, I think it is likely that I'm making a basic mistake or didn't convey something important about my data in my original post.

      The output I get from running this latest code is different from yours. I don't think your output is right at all - look at the dates. Here's what I get:

      1..2 not ok 1 - 2017-09-04 tally correct # Failed test '2017-09-04 tally correct' # at 1202625.pl line 53. # got: '1' # expected: '4' not ok 2 - 2017-09-30 tally correct # Failed test '2017-09-30 tally correct' # at 1202625.pl line 54. # got: undef # expected: '2' # Looks like you failed 2 tests of 2.
      Given that it worked fine in your examples, I think it is likely that I'm making a basic mistake or didn't convey something important about my data in my original post.

      It's the latter. In this new data set you have multiple instances of the double-hash-delimited strings in each hash value. Your code is only checking for the first such one in each value, hence the numbers I see in my output here.

      TIMTOWTDI for how to fix this but the simplest is a loop. This will work fine with your existing regular expressions but I've cleaned those up as well as an illustration of how to simplify them.

      use strict; use warnings; use Test::More tests => 2; my %mycorpus = ( a => "<p><time datetime=2017-09-04T05:23:39Z>04/09/17 06:23: +39</time> Irrelevant text that may feature the word soft, so +ftest, or softly. ar##*whispers softly* don\'t## ##very soft## ##the softest even## <97> 164 notes", b => "p><time datetime=2017-09-30T18:20:56Z>30/09/17 19:20 +:56</time> Irrelevant text that may feature the word soft, softest, o +r softly. 4r##skam## rr##isak valtersen## rr##even bech næsheim## dr##god## r##they're so soft## sr##my heart is bursting## ##This is the softest## <97> 379 notes Irrelevant text that may feature the word soft, softest, o +r softly.", c => "<p><time datetime=2017-09-04T05:27:03Z>04/09/17 06:2 +7:03</time> ##SKSNSKXBXKXND## r##I LOVE THESE## ##such soft boyfriend<99>## ##you're my sunshine## <97> 180 notes Irrelevant text that may feature the word soft, softest, or softly." ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $hashtags = ''; if ($mycorpus{$filename} =~ /(\d{4}-\d{2}-\d{2})T/g){ $date = $1; } while ($mycorpus{$filename} =~ /##(.*)##/g){ $hashtags = $1; if (my $matches =()= $hashtags =~ /\bsoft/gi){ $counts{$date} += $matches; } } } is ($counts{'2017-09-04'}, 4, "2017-09-04 tally correct"); is ($counts{'2017-09-30'}, 2, "2017-09-30 tally correct");

        Ah, brilliant! Thank you so much for all of your help and patience here: I have learned so much (least of all to check that I am providing the correct data before I ask people to help me!).