in reply to Perl threads - parallel run not working

Some time ago, Yary and I exchanged emails for walking directories in parallel. The following is our collaborative effort modified for the OP task.

use strict; use warnings; # usage: ./collect_files.pl <user> /var/tmp /tmp # output is saved under /var/tmp/test_<user>.file.list package Number; sub new { my ($class, $self) = (shift, shift || 0); bless \$self, $class; } sub decr { lock $_[0]; --${ $_[0] } } sub incr { lock $_[0]; ++${ $_[0] } } sub get { lock $_[0]; ${ $_[0] } } package Main; use threads; use threads::shared; use Thread::Queue; use File::Spec::Functions qw(catfile); use MCE; # Get user info and directories to start with my $user = shift; my $uidNum = getpwnam($user); my @start_dirs = @ARGV or die "Please specify one or more starting directories, stopped"; -d or die "No such directory: $_, stopped" for @start_dirs; # Shared queue and counter my $shared_work = Thread::Queue->new( @start_dirs ); my $free_count = shared_clone( Number->new(0) ); # Open output file open my $user_file, ">", "/var/tmp/test_$user.file.list" or die "cannot open file: $!"; sub traverse { $free_count->decr; my ( $dev, $ino, $mode, $nlink, $uid, $gid ); my @work = $_; while ( $_ = shift @work, defined ) { my ( $dir, $path, @paths ) = ( $_ ); opendir DH, $dir or next; for ( readdir DH ) { next if $_ eq '.' || $_ eq '..'; if ( -d ( $path = catfile($dir,$_) ) ) { ( @work < 15 ) ? push @work, $path : $shared_work->enqueue($path); next; } push @paths, $path; } for my $file ( @paths ) { if ( -f $file ) { ( $dev, $ino, $mode, $nlink, $uid, $gid ) = lstat( $file ) +; MCE->print( $user_file, "$file:$uidNum:$gid:$mode\n" ) if $uid == $uidNum; } } } # Done with our work, let everyone know we're free $shared_work->enqueue( (undef) x MCE->max_workers ) if $free_count->incr == MCE->max_workers && !$shared_work->pendi +ng; } my $mce = MCE->new( max_workers => 8, user_begin => sub { $free_count->incr }, user_func => sub { traverse() while ($_ = $shared_work->dequeue, de +fined) }, ); $mce->run; close $user_file;

Kind regards, Mario

Replies are listed 'Best First'.
Re^2: Perl threads - parallel run not working
by marioroy (Prior) on Sep 16, 2015 at 08:57 UTC

    Updated September 5, 2016: Modified code to run with MCE 1.8 / MCE::Shared 1.8.

    The following runs with Perl not built with threads support.

    use strict; use warnings; # usage: ./collect_files.pl <user> /var/tmp /tmp # output is saved under /var/tmp/test_<user>.file.list use File::Spec::Functions qw(catfile); use MCE; use MCE::Queue; use MCE::Shared; # Get user info and directories to start with my $user = shift; my $uidNum = getpwnam($user); my @start_dirs = @ARGV or die "Please specify one or more starting directories, stopped"; -d or die "No such directory: $_, stopped" for @start_dirs; # Shared queue and counter my $shared_work = MCE::Queue->new( fast => 1, queue => \@start_dirs ); my $free_count = MCE::Shared->scalar( 0 ); # Open output file open my $user_file, ">", "/var/tmp/test_$user.file.list" or die "cannot open file: $!"; sub traverse { $free_count->decr; my ( $dev, $ino, $mode, $nlink, $uid, $gid ); my @work = $_; while ( $_ = shift @work, defined ) { my ( $dir, $path, @paths ) = ( $_ ); opendir DH, $dir or next; for ( readdir DH ) { next if $_ eq '.' || $_ eq '..'; if ( -d ( $path = catfile($dir,$_) ) ) { ( @work < 15 ) ? push @work, $path : $shared_work->enqueue($path); next; } push @paths, $path; } for my $file ( @paths ) { if ( -f $file ) { ( $dev, $ino, $mode, $nlink, $uid, $gid ) = lstat( $file ) +; MCE->print( $user_file, "$file:$uidNum:$gid:$mode\n" ) if $uid == $uidNum; } } } # Done with our work, let everyone know we're free $shared_work->enqueue( (undef) x MCE->max_workers ) if $free_count->incr == MCE->max_workers && !$shared_work->pendi +ng; } my $mce = MCE->new( max_workers => 8, user_begin => sub { $free_count->incr }, user_func => sub { traverse() while ($_ = $shared_work->dequeue, de +fined) }, ); $mce->run; close $user_file;

    Kind regards, Mario

Re^2: Perl threads - parallel run not working
by Yary (Pilgrim) on Sep 16, 2015 at 15:21 UTC
    Processing large directory trees in parallel is an interesting problem, as "a directory with huge number of files" could be split up differently than "a huge number of branches" depending on one's approach. The method above should work with either (and is even more general than Gnu "parallel").

    File::Find can be made to work in parallel, but it is ugly. Cleaner to start new as we did here. I keep meaning to blog about this alternate solution (have post half-written) and turn it into a CPAN module...