Hi Hippo,

This is incredibly helpful, thank you! I've tried substituting your data with a few extracts of my own in the 'in-script' example, and it worked brilliantly, passed the tests, and printed out exactly what I expected when I added that function to the script.

Unfortunately, the problems arise when I try to use the script with the hash created by my getCorpus function, as illustrated by the output here:

1..2 2017-09-04 = 1 not ok 1 - 2017-09-04 tally correct # Failed test '2017-09-04 tally correct' # at C:\Users\li\test9.pl line 67. # got: '1' # expected: '3' not ok 2 - 2017-09-30 tally correct # Failed test '2017-09-30 tally correct' # at C:\Users\li\test9.pl line 68. # got: undef # expected: '2' # Looks like you failed 2 tests of 2.

It looks like, given that it didn't even count any of the two instances of the word in my 2017-09-30 documents, my problems may run deeper than I first believed. I think I might need to go back and check that the function is actually getting everything that I expect it to from the given folders!

Thanks again for your help though and for illustrating the use of the "Test" module: I think that might come in useful!

UPDATE: It appears that the function is actually working okay, after all! I've used Data:Dumper to print out the entire contents of the hash discussed above, and it prints out everything that I expected it to, suggesting that the problem lies elsewhere. I'm going to double check the regex next and make sure that it always captures what I expect it to.

UPDATE2: Okay, it looks like the problem may be something to do with the regex. I put aside trying to count the words and instead just tried to print out the content in the hash that matched my regex. I ran the following script (I know that this is probably a horribly complex way to do this, but the script was already in my index catalogue!):

use Data::Dumper; open (OUT1, ">alldata.txt") or die; %mycorpus = getCorpus('C:\Users\li\test11'); my $href = \%mycorpus; # reference to the hash %mycorpus print OUT1 Dumper $href; # note the order will differ from + that above print OUT1 "\n"; open (OUT2, ">editeddata.txt") or die; open( FILE, "alldata.txt" ) || die "couldn't open\n"; $/ = undef; while (<FILE>) { while (/(?<=##)(.*)(?=##)/g) { print OUT2 "$1\n"; }

So the first part of the code prints out the entire contents of the hash into a text file and the second part opens that file and prints out (into a second file) only the lines that match the regular expression. In total, 35 lines should have been printed into the second file, but instead, only 22 were.

There is nothing that distinguishes the missed lines from the captured ones: they are all formatted in exactly the same way and all are captured when the regex is run on e.g. https://regex101.com/. Moreover, if I run a very simple programme below that prints out the regex matches from a single file, the "missed" lines are captured:

open(FILE, 'C:\Users\li\test11\164949.txt'); while (<FILE>) { if ( /(?<=##)(.*)(?=##)/g ) { print "$1\n"; } }

In other words, the lines that are not being captured do match the regex and should be being printed. It looks to me like this is probably the root of my problems, and, if I fix this, then it should start counting the frequency of my words properly!


In reply to Re^2: Counting instances of a string in certain sections of files within a hash by Maire
in thread Counting instances of a string in certain sections of files within a hash by Maire

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.