Hi Monks,
I'm trying to write a script which returns the frequency of a given word (per day) in certain extracts of a very large number of files from a single folder.
So, I have text files which begin with a timestamp and contain a lot of irrelevant metadata, but all of the files have text that I'm interested in enclosed between two hashtags.
e.g<time datetime=2017-09-03T23:17:53Z> Irrelevant text. Irrelevent text. More text. ##This is the data that I am interested in.## More text. More text.
I have a script which, when run, returns the frequency of a given word per day in all of the text files from a given folder:
So, this script sorts through a hash (compiled using a function which is not reproduced here because it is not publicly available). I first get the datestamps (which then serve as the keys for a new hash) and then I say that the count value should be increased each time the script finds the word soft (or softest, softer etc.) being used. I then print out the datestamp and the freqency of usage of soft etc. per day
The OUTPUT is:my %mycorpus = getCorpus('C:\Users\li\test4'); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $tags; my $comments; my $word; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ $date = $1; } if ($mycorpus{$filename} =~ /\bsoft/gi){ $counts{$date}++; } } foreach my $date (sort keys %counts){ print "$date = $counts{$date}\n"; }
2017-08-31 = 42 2017-09-01 = 25 2017-09-02 = 40 2017-09-03 = 34 2017-09-04 = 26 .....
This script works as expected and the values/counts that are printed are accurate (i.e. "soft" WAS used 42 times on 31/08/2017). However, this script returns the frequency of "soft" in all of the text, and I'm only interested in how many times it occurs within the "hashtagged" sections. I have attempted to return only the frequency of the word in the hashtagged section using the following script:
my %mycorpus = getCorpus('C:\Users\li\test4'); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $tags; my $comments; my $word; if ($mycorpus{$filename} =~ /(?<==)(\d{4}-\d{2}-\d{2})(?=T)/g) +{ + $date = $1; } if ($mycorpus{$filename} =~ /(?<=##)(.*)(?=##)/g){ $hashtags = $1; } if ($hashtags =~ /\bsoft/gi){ $counts{$date}++; } } foreach my $date (sort keys %counts){ print "$date = $counts{$date}\n"; }
After a bit of messing about, I have managed to get it to produce output, but the count values are a lot lower then they should be:
2017-09-01 = 1 2017-09-02 = 3 2017-09-03 = 6 2017-09-04 = 2 2017-09-05 = 1 2017-09-06 = 3 2017-09-07 = 3
For instance, on 2017-09-04, there are at least ten instances of "soft" in the text matched by the regex, but only "2" have been counted.
I've rewritten this script dozens of times, but I can't seem to spot my mistake(s). Any guidance would be very much appreciated.
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |