comment on

Hi Monks,

I'm trying to write a script which returns the frequency of a given word (per day) in certain extracts of a very large number of files from a single folder.

So, I have text files which begin with a timestamp and contain a lot of irrelevant metadata, but all of the files have text that I'm interested in enclosed between two hashtags.

e.g

<time datetime=2017-09-03T23:17:53Z>

Irrelevant text.
Irrelevent text.
More text.
##This is the data that I am interested in.##
More text.
More text.
[download]

I have a script which, when run, returns the frequency of a given word per day in all of the text files from a given folder:

So, this script sorts through a hash (compiled using a function which is not reproduced here because it is not publicly available). I first get the datestamps (which then serve as the keys for a new hash) and then I say that the count value should be increased each time the script finds the word soft (or softest, softer etc.) being used. I then print out the datestamp and the freqency of usage of soft etc. per day

my %mycorpus = getCorpus('C:\Users\li\test4');  

my %counts;   


    foreach my $filename (sort keys %mycorpus) {        
        my $date;
        my $tags;
        my $comments;
        my $word;

        if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g)
+{    
            $date = $1;
        
        }
                
        if ($mycorpus{$filename} =~ /\bsoft/gi){    
            $counts{$date}++;                    

        }

            }
    foreach my $date (sort keys %counts){            
        print "$date = $counts{$date}\n";    
    }
[download]

The OUTPUT is:

2017-08-31 = 42
2017-09-01 = 25
2017-09-02 = 40
2017-09-03 = 34
2017-09-04 = 26
.....
[download]

This script works as expected and the values/counts that are printed are accurate (i.e. "soft" WAS used 42 times on 31/08/2017). However, this script returns the frequency of "soft" in all of the text, and I'm only interested in how many times it occurs within the "hashtagged" sections. I have attempted to return only the frequency of the word in the hashtagged section using the following script:

my %mycorpus = getCorpus('C:\Users\li\test4'); 

my %counts;    


    foreach my $filename (sort keys %mycorpus) {            
        my $date;
        my $tags;
        my $comments;
        my $word;

        if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g)
+{    
                                                                      
+      
            $date = $1;
        
        } 
        
        if ($mycorpus{$filename} =~ /(?<=##)(.*)(?=##)/g){
        
            $hashtags = $1;
        
        }
            
         if ($hashtags =~ /\bsoft/gi){        
        
             $counts{$date}++;                
         }
    }

                    
            
    foreach my $date (sort keys %counts){        
        print "$date = $counts{$date}\n";    
    }
[download]

After a bit of messing about, I have managed to get it to produce output, but the count values are a lot lower then they should be:

2017-09-01 = 1
2017-09-02 = 3
2017-09-03 = 6
2017-09-04 = 2
2017-09-05 = 1
2017-09-06 = 3
2017-09-07 = 3
[download]

For instance, on 2017-09-04, there are at least ten instances of "soft" in the text matched by the regex, but only "2" have been counted.

I've rewritten this script dozens of times, but I can't seem to spot my mistake(s). Any guidance would be very much appreciated.

In reply to Counting instances of a string in certain sections of files within a hash by Maire

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.