comment on

Hi Monks,

I've been using the following script to count the number of times per day words appear in two separate sections of text within my corpus (which has been stored in a hash).

sub getCorpus {                
my %corpus;                    
my $text;                    
 opendir (DR, "$_[0]") || die ("Cannot open directory");  
my @files = readdir(DR);
 for my $i (0 .. $#files) {            
 if ($files[$i] =~ /\.txt/ && $files[$i] !~ /\._/) {    
 {
local $/ = undef ;                            
 open(FILE, "$_[0]/$files[$i]") or die ("file not found");
 $text = <FILE> ;    
}
 $files[$i] =~ s{\.txt}{};
 $corpus{$files[$i]} = $text; 
}
}
return %corpus;                    #Returns a hash called corpus
}

my %mycorpus = (
    a => "date:#20180101# title:#cat dog# text:#sheep sheep sheep shee
+p#" ,

    b => "date:#20180101# title:#cow puppy# text:#pig pig pig#",
);

my %counts;
foreach my $filename (sort keys %mycorpus) {
        my $date;
        my $dataset = '';

                #get date
        while ($mycorpus{$filename} =~ /date:#(\d{8})#/g){ 
            $date = $1; 
        }
        
                #get part 1 of dataset
         while ($mycorpus{$filename} =~ /title:#(.*?)#/g){
            $dataset = $1;
                    #Actions usually performed here which clean the ti
+tles
                                                }    
                #get part 2 of dataset                                
+                                
          while ($mycorpus{$filename} =~ /text:#(.*?)#/g){
            $dataset = $1;
            #Actions usually performed here which clean the text
                                                }
            my @words = split /\W+/, $dataset;
            
            foreach my $word (@words){
                
                if ($word =~ /(\w+)/gi){
                    $word =~ tr/A-Z/a-z/;
                    $counts{$date}{$word}++;
                    $word_types{$word}++;
                    $overallcounts{$date}++;
                }
            }   
        }

 use Data::Dumper;
 print Dumper \%counts;
[download]

This script has largely worked without issue and produces the desired output. However, I have recently been trying to modify the script so that I am using three while loops (instead of two, as above) to populate the scalar $dataset, but in these instances, the first while loop appears to be ignored and only the 2nd and 3rd populate $dataset. In other words, in the example below, I am only seeing data for the "text" and "comments" but not the titles.

sub getCorpus {                
my %corpus;                    
my $text;                    
 opendir (DR, "$_[0]") || die ("Cannot open directory");  
my @files = readdir(DR);
 for my $i (0 .. $#files) {            
 if ($files[$i] =~ /\.txt/ && $files[$i] !~ /\._/) {    
 {
local $/ = undef ;                            
 open(FILE, "$_[0]/$files[$i]") or die ("file not found");
 $text = <FILE> ;    
}
 $files[$i] =~ s{\.txt}{};
 $corpus{$files[$i]} = $text; 
}
}
return %corpus;                    #Returns a hash called corpus
}

my %mycorpus = (
    a => "date:#20180101# title:#cat dog# text:#sheep sheep sheep shee
+p#" ,

    b => "date:#20180101# comment:#woof woof#",

    c => "date:#20180101# title:#cow puppy# text:#pig pig pig#",
);



my %counts;
foreach my $filename (sort keys %mycorpus) {
        my $date;
        my $dataset = '';

                
        while ($mycorpus{$filename} =~ /date:#(\d{8})#/g){ 
            $date = $1; 
        }
            
         while ($mycorpus{$filename} =~ /title:#(.*?)#/g){
            $dataset = $1;
                    #Actions usually performed here which clean the ti
+tles (i.e. substituting certain characters)
                                                }    
                                                
         while ($mycorpus{$filename} =~ /text:#(.*?)#/g){
            $dataset = $1;
            #Actions usually performed here which clean the text
                    
                                                }
                                                                      
+          
          while ($mycorpus{$filename} =~ /comment:#(.*?)#/g){
                $dataset = $1;
                    #Actions usually performed here which clean the co
+mments    
                                                 }    
            
            my @words = split /\W+/, $dataset;
            
            foreach my $word (@words){
                
                if ($word =~ /(\w+)/gi){
                    $word =~ tr/A-Z/a-z/;
                    $counts{$date}{$word}++;
                    $word_types{$word}++;
                    $overallcounts{$date}++;
                }
            }   
        }

 use Data::Dumper;
 print Dumper \%counts;
[download]

The output I was expecting was:

$VAR1 = {
          '20180101' => {
                          'puppy' => 1
                          'dog' => 1
                          'cat' => 1
                          'cow' => 1
                          'sheep' => 4,
                          'woof' => 2,
                          'pig' => 3
                        }
        };
[download]

But in reality I got:

$VAR1 = {
          '20180101' => {
                          'sheep' => 4,
                          'woof' => 2,
                          'pig' => 3
                        }
        };
[download]

Does it seem like I am doing something wrong here, or is this a recognised way that while loops work? In the instance of the latter, is there any way to work around this? Thanks!

In reply to Populating scalar using more than 2 while loops. by Maire

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.