Maire has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I've been using the following script to count the number of times per day words appear in two separate sections of text within my corpus (which has been stored in a hash).

sub getCorpus { my %corpus; my $text; opendir (DR, "$_[0]") || die ("Cannot open directory"); my @files = readdir(DR); for my $i (0 .. $#files) { if ($files[$i] =~ /\.txt/ && $files[$i] !~ /\._/) { { local $/ = undef ; open(FILE, "$_[0]/$files[$i]") or die ("file not found"); $text = <FILE> ; } $files[$i] =~ s{\.txt}{}; $corpus{$files[$i]} = $text; } } return %corpus; #Returns a hash called corpus } my %mycorpus = ( a => "date:#20180101# title:#cat dog# text:#sheep sheep sheep shee +p#" , b => "date:#20180101# title:#cow puppy# text:#pig pig pig#", ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $dataset = ''; #get date while ($mycorpus{$filename} =~ /date:#(\d{8})#/g){ $date = $1; } #get part 1 of dataset while ($mycorpus{$filename} =~ /title:#(.*?)#/g){ $dataset = $1; #Actions usually performed here which clean the ti +tles } #get part 2 of dataset + while ($mycorpus{$filename} =~ /text:#(.*?)#/g){ $dataset = $1; #Actions usually performed here which clean the text } my @words = split /\W+/, $dataset; foreach my $word (@words){ if ($word =~ /(\w+)/gi){ $word =~ tr/A-Z/a-z/; $counts{$date}{$word}++; $word_types{$word}++; $overallcounts{$date}++; } } } use Data::Dumper; print Dumper \%counts;

This script has largely worked without issue and produces the desired output. However, I have recently been trying to modify the script so that I am using three while loops (instead of two, as above) to populate the scalar $dataset, but in these instances, the first while loop appears to be ignored and only the 2nd and 3rd populate $dataset. In other words, in the example below, I am only seeing data for the "text" and "comments" but not the titles.

sub getCorpus { my %corpus; my $text; opendir (DR, "$_[0]") || die ("Cannot open directory"); my @files = readdir(DR); for my $i (0 .. $#files) { if ($files[$i] =~ /\.txt/ && $files[$i] !~ /\._/) { { local $/ = undef ; open(FILE, "$_[0]/$files[$i]") or die ("file not found"); $text = <FILE> ; } $files[$i] =~ s{\.txt}{}; $corpus{$files[$i]} = $text; } } return %corpus; #Returns a hash called corpus } my %mycorpus = ( a => "date:#20180101# title:#cat dog# text:#sheep sheep sheep shee +p#" , b => "date:#20180101# comment:#woof woof#", c => "date:#20180101# title:#cow puppy# text:#pig pig pig#", ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; my $dataset = ''; while ($mycorpus{$filename} =~ /date:#(\d{8})#/g){ $date = $1; } while ($mycorpus{$filename} =~ /title:#(.*?)#/g){ $dataset = $1; #Actions usually performed here which clean the ti +tles (i.e. substituting certain characters) } while ($mycorpus{$filename} =~ /text:#(.*?)#/g){ $dataset = $1; #Actions usually performed here which clean the text } + while ($mycorpus{$filename} =~ /comment:#(.*?)#/g){ $dataset = $1; #Actions usually performed here which clean the co +mments } my @words = split /\W+/, $dataset; foreach my $word (@words){ if ($word =~ /(\w+)/gi){ $word =~ tr/A-Z/a-z/; $counts{$date}{$word}++; $word_types{$word}++; $overallcounts{$date}++; } } } use Data::Dumper; print Dumper \%counts;

The output I was expecting was:

$VAR1 = { '20180101' => { 'puppy' => 1 'dog' => 1 'cat' => 1 'cow' => 1 'sheep' => 4, 'woof' => 2, 'pig' => 3 } };

But in reality I got:

$VAR1 = { '20180101' => { 'sheep' => 4, 'woof' => 2, 'pig' => 3 } };

Does it seem like I am doing something wrong here, or is this a recognised way that while loops work? In the instance of the latter, is there any way to work around this? Thanks!

Replies are listed 'Best First'.
Re: Populating scalar using more than 2 while loops.
by haukex (Archbishop) on Aug 25, 2018 at 09:59 UTC

    In your inner while loops, $dataset = $1; is overwriting the contents of $dataset on every match. Personally, instead of appending the matches to a string, I would have split the matches directly inside the inner while, and then pushed the individual words onto an array or stored them in a hash directly. But in this code, one quick fix is to do $dataset .= " $1 "; inside the loop instead, so that the matches are appended to the string rather than replacing it. Also, note that in your second example, three of the inner while loops can be combined:

    while ( $mycorpus{$filename} =~ /(?:title|text|comment):#(.*?)#/g ) { $dataset .= " $1 "; }

    It's good you're using Data::Dumper, I would recommend to insert more print statements to debug issues like this - one in every loop of your code would give a hint as to what's going on.

    (BTW, in general, please omit code that isn't relevant to the question, in this case that's both sub getCorpus definitions, and run your code through perltidy.)

      Ah, thank you very much for this. That does seem a much cleaner way of doing it: I have adopted my script accordingly. Also, thank you for the "posting" tips! I tend to err on the side of caution and give way TMI, but I take your point and will try to clean-up before posting in future. Thanks again!
Re: Populating scalar using more than 2 while loops.
by poj (Abbot) on Aug 25, 2018 at 10:43 UTC

    If each cleanup action is different, consider putting that code in a subroutine to simplify the while loop. For example

    #!perl use strict; my %mycorpus = ( a => "date:#20180101# title:#cat dog# text:#sheep sheep sheep sheep# +" , b => "date:#20180101# comment:#woof woof#", c => "date:#20180101# title:#cow puppy# text:#pig pig pig#", ); my %counts; foreach my $filename (sort keys %mycorpus) { my $date; if ($mycorpus{$filename} =~ /date:#(\d{8})#/g){ $date = $1; } else { die "No date found filename : $filename => $mycorpus{$filename}"; } while ($mycorpus{$filename} =~ /(title|text|comment):#(.*?)#/g){ my $dataset = cleanup($1,$2); my @words = split /\W+/, $dataset; foreach my $word (@words){ if ($word =~ /(\w+)/gi){ $word =~ tr/A-Z/a-z/; $counts{$date}{$word}++; #$word_types{$word}++; #$overallcounts{$date}++; } } } } sub cleanup { my ($key,$value) = @_; if ($key eq 'title'){ #Actions usually performed here which clean the titles #(i.e. substituting certain characters) } if ($key eq 'text'){ #Actions usually performed here which clean the text } if ($key eq 'comment'){ #Actions usually performed here which clean the comments } return $value; } use Data::Dumper; print Dumper \%counts;
    poj
      This is really helpful, thank you very much!
Re: Populating scalar using more than 2 while loops.
by Marshall (Canon) on Aug 25, 2018 at 20:10 UTC
    As my first comment, you should have
    use strict; use warnings;
    at the start of your scripts.
    Using those directives, the errors are:
    Global symbol "%word_types" requires explicit package name (did you fo +rget to declare "my %word_types"?) at C:\Users...\PerlProjects\Monks\ +Node_122108_getCorpus.pl line 69. Global symbol "%overallcounts" requires explicit package name (did you + forget to declare "my %overallcounts"?) at C:\Users....\PerlProjects +\Monks\Node_122108_getCorpus.pl line 70. BEGIN not safe after errors--compilation aborted at C:\Users\mmtho\Doc +uments\PerlProjects\Monks\Node_122108_getCorpus.pl line 75.
      Ah, okay thank you!