Hi Monks,

I'm trying to write a script which returns the frequency of a given word (per day) in certain extracts of a very large number of files from a single folder.

So, I have text files which begin with a timestamp and contain a lot of irrelevant metadata, but all of the files have text that I'm interested in enclosed between two hashtags.

e.g
<time datetime=2017-09-03T23:17:53Z> Irrelevant text. Irrelevent text. More text. ##This is the data that I am interested in.## More text. More text.

I have a script which, when run, returns the frequency of a given word per day in all of the text files from a given folder:

So, this script sorts through a hash (compiled using a function which is not reproduced here because it is not publicly available). I first get the datestamps (which then serve as the keys for a new hash) and then I say that the count value should be increased each time the script finds the word soft (or softest, softer etc.) being used. I then print out the datestamp and the freqency of usage of soft etc. per day

my %mycorpus = getCorpus('C:\Users\li\test4'); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $tags; my $comments; my $word; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /\bsoft/gi){ $counts{$date}++; } } foreach my $date (sort keys %counts){ print "$date = $counts{$date}\n"; }
The OUTPUT is:
2017-08-31 = 42 2017-09-01 = 25 2017-09-02 = 40 2017-09-03 = 34 2017-09-04 = 26 .....

This script works as expected and the values/counts that are printed are accurate (i.e. "soft" WAS used 42 times on 31/08/2017). However, this script returns the frequency of "soft" in all of the text, and I'm only interested in how many times it occurs within the "hashtagged" sections. I have attempted to return only the frequency of the word in the hashtagged section using the following script:

my %mycorpus = getCorpus('C:\Users\li\test4'); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $tags; my $comments; my $word; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ + $date = $1; } if ($mycorpus{$filename} =~ /(?<=##)(.*)(?=##)/g){ $hashtags = $1; } if ($hashtags =~ /\bsoft/gi){ $counts{$date}++; } } foreach my $date (sort keys %counts){ print "$date = $counts{$date}\n"; }

After a bit of messing about, I have managed to get it to produce output, but the count values are a lot lower then they should be:

2017-09-01 = 1 2017-09-02 = 3 2017-09-03 = 6 2017-09-04 = 2 2017-09-05 = 1 2017-09-06 = 3 2017-09-07 = 3

For instance, on 2017-09-04, there are at least ten instances of "soft" in the text matched by the regex, but only "2" have been counted.

I've rewritten this script dozens of times, but I can't seem to spot my mistake(s). Any guidance would be very much appreciated.


In reply to Counting instances of a string in certain sections of files within a hash by Maire

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.