creating unknown number of threads and then join results

rudds_perl_habit has asked for the wisdom of the Perl Monks concerning the following question:

I have some successful scripts that use threads, but they all have a set number of threads that they create. Now I am trying to create a script that will create X number of threads where the number of threads is determined by the number of search directories. In this particular case, each thread runs a "cleartool find" command on a directory to get an array of results back. For this example I am just using the unix find command. But the "cleartool find" command in ClearCase is similar, but takes a lot longer to run.

So what I am finding is that on small data it seems to work fine. I get consistent results. But on those really long running clearcase commands, I don't always get all the data I expect in the @Final array. There is probably a way to do this better... maybe locking the variable before I update it? I was thinking that each thread is updating a different hash key of the variable, so it should be safe to update this way? Or does it need to be locked before each join statement? Any suggestions on how to do this better?

#!/usr/local/bin/perl

use Cwd;
use threads;
use Data::Dumper;

my $use_cc = 0;
my @dirs = ();
if ( $use_cc ) {
  @dirs = split(/\s+/, $ENV{CLEARCASE_AVOBS});
} else {
  @dirs = qw(/bin /sbin /usr/local/bin /usr/sfw/bin /usr/bin);
}

# it's a clearcase thing
my $branch = "v4.0.0_gxp_patch";

# hash of dir names with thread values
my %threads = ();

# hash of dir names with arrays of found items
my %Found = ();

# large arry to hold all results
my @Final = ();

foreach my $dir ( sort @dirs ) {
  chomp($dir);
  # add dir name to hash
  $Found{$dir} = ();
  # create thread and add it to threads hash
  $threads{$dir} = threads->create({'context' => 'list'}, 'find_thread
+', $dir, $use_cc, $branch);
}
foreach my $dir ( sort keys %threads ) {
  # cycle through threads hash and join up results, put them in hash-o
+f-arrays
  @{ $Found{$dir} } = $threads{$dir}->join();
}
# still all the smaller hash-of-arrays into a large array for easier p
+rocessing later on
foreach my $dir ( sort keys %Found ) {
  foreach my $item ( sort @{ $Found{$dir} } ) {
    push(@Final, $item);
  }
}
print Dumper(@Final);
print "SIZE: " . scalar(@Final) . "\n";

sub find_thread {
  my $dir = shift;
  my $cc_flag = shift;
  my $branch = shift;
  my @results;
  chdir $dir or die "Cannot change to $dir\n";
  print "Finding all files in dir: $dir\n";
  if ( $cc_flag ) {
    @results = `cleartool find -all -version 'brtype($branch)' -print 
+2>&1`;
  } else {
    @results = `find $dir -print 2>&1`;
  }
  return @results;
}
[download]

Comment on creating unknown number of threads and then join results Download Code

Replies are listed 'Best First'.
Re: creating unknown number of threads and then join results by BrowserUk (Patriarch) on Jul 29, 2013 at 23:42 UTC
maybe locking the variable before I update it? I was thinking that each thread is updating a different hash key of the variable, so it should be safe to update this way? Or does it need to be locked before each join statement? All your updates to `%Found` are done within the same thread, so there is no need to lock anything. Besides which `%Found` isn't a shared variable, so you couldn't lock it if you tried. Any suggestions on how to do this better? Apart from this: `foreach my $dir ( sort keys %Found ) { foreach my $item ( sort @{ $Found{$dir} } ) { push(@Final, $item); } }` [download] Could be more efficiently written as: `foreach my $dir ( sort keys %Found ) { push(@Final, sort @{ $Found{$dir} }); }` [download] Not really. It is hard to see any scope for you not getting all the results produced by the external commands. Perhaps you could print out the size of `@results` before returning and then sum those and compare it with the size of `@Final`? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: creating unknown number of threads and then join results by rudds_perl_habit (Novice) on Jul 30, 2013 at 16:36 UTC
Thanks for the suggestion about dumping the `@results` array. That did help. Well, sort of. It made me more confused, actually. In the find_thread routine, I added a line before the return: `print "find_thread dump $dir: " . Dumper(@results) . "\n";` I then run my script 10 times dumping the results to a 10 files. What I am finding is that the dump of `@results` can sometimes have output that is from another directory entirely. For example: find_thread dump /vobs/doc: $VAR1 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/0'; $VAR2 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/1'; $VAR3 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/2'; $VAR4 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/3'; SIZE /vobs/doc: 4 find_thread dump /vobs/drs: $VAR1 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/0'; $VAR2 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/1'; $VAR3 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/2'; $VAR4 = '/vobs/license_manager/install/LMInstall/LM_Install.iap_xml@@/ +main/v4.0.0_gxp_patch/3'; SIZE /vobs/drs: 4 [download] Which is totally confusing. `@results` is local to `find_thread` and shouldn't know anything about the other threads. What is even weirder is that when I switch to not use ClearCase find, and just find directories on the system, it seems to all work fine. So at this point, I am thinking that spawning multiple ClearCase find commands at once is causing an issue. I'll take it up with IBM.	[reply] [d/l] [select]
Re^3: creating unknown number of threads and then join results by BrowserUk (Patriarch) on Jul 30, 2013 at 18:32 UTC
So at this point, I am thinking that spawning multiple ClearCase find commands at once is causing an issue. I'll take it up with IBM. I concur. There is nothing in your code that could account for the symptoms you are seeing, so their source can only lie with the commands you are calling. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: creating unknown number of threads and then join results by rudds_perl_habit (Novice) on Jul 30, 2013 at 18:43 UTC
Re^5: creating unknown number of threads and then join results by BrowserUk (Patriarch) on Jul 30, 2013 at 18:54 UTC
Re^5: creating unknown number of threads and then join results (ex::threads::safecwd) by Anonymous Monk on Jul 31, 2013 at 00:22 UTC