http://qs1969.pair.com?node_id=1170878

exilepanda has asked for the wisdom of the Perl Monks concerning the following question:

A little background first. I am working on Win7 and going to develop a dir tree walker module. Since File::List won't deal with utf8 file names, neither dir /b/s(mush faster though). I am going to build this dir walker module with Win32::Unicode::Dir. Since a single thread goes really so slow, so I want to use thread as well. So, here's my work so far, but things goes really weird.
$| = 1; package Vz::DirTree; use Data::Dumper; use strict; use threads ; use threads::shared; use Devel::Pointer; sub _fetchDir { my ( $c , $root ) = @_; print "$root>$/"; my $dir; eval { opendir $dir, $root or die "$! Can't open '$root'" }; return if $@; foreach my $file ( readdir $dir ) { next if $file =~/^\.{1,2}/; $file = "$root/$file"; # print $file. $/; <------ Uncomment if you want to see progre +ss push @{$c->{isFile}}, $file if -f $file; push @{$c->{isDir}}, $file if -d $file; push @{$c->{dirToFetch}}, $file if -d $file; } closedir $dir; } sub _walk { my $addr = shift; my $obj = deref ( $addr ) ; my $root = shift ( @{$obj->{dirToFetch}} ) ; $obj -> _fetchDir ( $root ) ; } sub setRoot { my $c = shift; my $root = shift; $root = "./" unless $root; my $self = bless { dirToFetch => [], isDir => [], isFile => [], }, $c; $self->_fetchDir ($root); # first scan from root dir my $objAddr : shared = address_of ( $self ) ; while ( @{$self->{dirToFetch}} ) { sleep 1; threads->create ( '_walk' , $objAddr ) -> join; } return $self; } 1; package main; use Data::Dumper; my $dir = Vz::DirTree->setRoot ( "M:/" ) ; print Dumper $dir;
This code never able run to the end, but depends on different dir that I set as root, different error can pop out. What errors I've seen so far includes:

- scalars leaks: -1 ( or some -ve number )
- Free to wrong pool 2278d00 not 7349a8 during global destruction.
- Bizarre SvTYPE $someNum

Any idea why this code goes that wrong??? What result will show up in your box?? Any clue is much appreciated.

UPDATE: This will Work!

package Vz::DirTree; use Data::Dumper; use strict; use threads ; use threads::shared; my @dirToFetch : shared ; my @isFile : shared; my @isDir : shared; my @warn : shared; my $ModuleBusy; sub _fetchDir { my $root = shift; my $dir; eval { opendir $dir, $root or die "$! Can't open '$root'" }; if ( $@ ) { push @warn, $root ; return undef; } foreach my $file ( readdir $dir ) { next if $file =~/^\.{1,2}/; $file = "$root/$file"; if ( -f $file ) { push @isFile, $file if -f $file; } elsif ( -d $file ) { push @isDir, $file; push @dirToFetch, $file; } } closedir $dir; } sub setRoot { my $c = shift; my $root = shift; $root = "./" unless $root; my $waitIfBusy = grep { /-WaitIfBusy/ } @_ ? 1 : 0 ; if ( $ModuleBusy && ! $waitIfBusy ) { die __PACKAGE__ . " is busy. Pass -WaitIfBusy as argv if you c +an wait "; } elsif ( $ModuleBusy && $waitIfBusy ) { sleep 1 until ! $ModuleBusy } $ModuleBusy = 1; my $self = bless {isDir => [], isFile => [], duration => und +ef}, $c; _fetchDir ($root); # first scan from root dir my $startTime = time; my $walker = threads-> create ( sub { while ( @dirToFetch ) { my $t = scalar(threads->list); next if $t > @dirToFetch; # so that threads will not empty @dirToFetch before # other thread finish their result back to @dirToFetch +. threads->create ( sub { _fetchDir(shift(@dirToFetch))} ) ; } }); while (threads->list) { my $t = scalar(threads->list); $self->{MaxThreadHit} = $t if $t > $self->{MaxThreadHit}; $_ -> join foreach ( threads->list ( threads::joinable ) ) ; } my $endTime = time; $self -> {duration} = $endTime - $startTime ; @{$self->{isDir}} = @isDir; @{$self->{isFile}} = @isFile; @{$self->{failOpen}} = @warn; @isFile = (); @isDir=(); @warn=(); $ModuleBusy = 0; return $self; } 1; package main; use Data::Dumper; my $dir = Vz::DirTree->setRoot ( "M:/" ) ; print Dumper $dir;

Replies are listed 'Best First'.
Re: use threads for dir tree walking really hurts
by Corion (Patriarch) on Aug 31, 2016 at 13:29 UTC
    use Devel::Pointer; ... my $obj = deref ( $addr ) ; my $root = shift ( @{$obj->{dirToFetch}} ) ;

    Why are you doing that?

    Perl is not C and you don't need to step outside the Perl datatypes to handle data access from multiple threads within Perl.

    The following should be the equivalent of what you do, except far saner and not needing Devel::Pointer:

    ... sub _walk { my $obj = shift; my $root = shift ( @{$obj->{dirToFetch}} ) ; $obj -> _fetchDir ( $root ) ; } ... threads->create ( '_walk' , $self ) -> join;

    Note that you do not even start running multiple threads in the above because you spawn a separate thread but don't continue until it has finished its work. Most likely, a better approach is to store all threads and then wait for them to finish:

    ... push my @running, threads->create ( '_walk' , $self ); ... while( @running ) { my $next = shift @running; $next->join; };

    Personally, I recommend using Thread::Queue and a worker pool to handle a workload because starting a Perl thread is relatively resource intensive. I'm not sure that using multiple threads will bring you much benefit, as I think your operation largely is limited by the network or the HD (or filesystem) performance.

    Thinking more about it, I guess that a somewhat better approach is to have all directories to crawl stored in a Thread::Queue and to have threads fetch from that whenever they need to crawl a new directory. For output, I would use another Thread::Queue, just for simplicissity (roughly adapted from here:

    #! perl -slw use strict; use threads; use Thread::Queue; my $directories = Thread::Queue->new(); my $files = Thread::Queue->new(); use vars '$NUM_CPUS'; $NUM_CPUS ||= 4; sub _walk { while( defined my $dir = $directories->dequeue) {; my @entries = ...; for my $e (@entries) { if( -d $e ) { # depth-first search $directories->insert(0, $e); } else { # It would be much faster to enqueue all files in bulk + instead # of enqueueing them one by one, but first get it work +ing before # you make it fast $files->enqueue( $e ); }; }; }; } $directories->enqueue( @ARGV ); for ( 1..$NUM_CPUS ) { threads->new( \&_walk )->detach; }; print while defined( $_ = $files->dequeue ); print 'Done';
      Thank you very much for your vivid elaboration which is very inspiring. =D
      Why are you doing that?
      Because when an object fall into a thread scope, the object will be cloned, which is not the one I want. And since threads don't share object / complex data structure ( and I don't what to share them one by one ), this trick do share the object perfectly... until it's not.

      Actually, I can do the job with simply: @dirToFetch : shared, but same issue Thread::Queue, I gotta leave it at a nested package scope , but create it inside an object become another mess to share around threads. Because I attempt to make it a module, so I hope to avoid if other script calling this module in threads, the data will mess up.

      Though, I've update my OP's code, which will work and as fast as dir /s/b I create as many threads as how much in @dirToFetch

        Why are you doing that?
        Because when an object fall into a thread scope, the object will be cloned, which is not the one I want. And since threads don't share object / complex data structure ( and I don't what to share them one by one ), this trick do share the object perfectly... until it's not.

        Yes - due to Perls reference counting, accessing variables in another threads memory always means that your thread will also be writing at least to the refcount field of that variable. If the refcount happens to reach zero in another thread than where the piece of memory was originally allocated, the memory will be freed in the wrong thread context, which is not fun.

        I'm not aware of a way to make Perl skip its refcounting for variables, and I'm also not convinced that this could work except in the most trivial cases.

        Another idea to reduce the conceptual load of the appriach might be to simply shell out to cmd /c "dir /b /s $directory", but then you need to be aware of the codepage that cmd.exe uses for its output. Ideally you have set the codepage to Unicode / 65001:

        chcp 65001

        ... but then, you still have to live with the fun of Perl and the OS treating the octets for filenames differently unless you properly decode and encode them.

Re: use threads for dir tree walking really hurts (MFT?)
by Discipulus (Canon) on Sep 01, 2016 at 08:26 UTC
    Very interesting topic, and ++Corion for the answer. Some sparse suggestion:

    For the little i know the matter I suppose that beside the thread implementation you put on the field you must rely on the speed of the filesystem and of all underlying OS specific API call.

    I doubt that more CPU can fetch a physical hard drive faster than a single one.

    My experience fighting with windows is very long, so long to let me say that is by far better and faster to use as much as possible native tools offered (or well, concelead) by OS. An example is fetching file permissions: old Perl modules exhisted but wrapping around tools like icacls.exe is faster and less error prone and works for decades.

    So going for native solutions I think the best would be to read the Master File Table directly: this intrigue me a lot but i suspect is a task by far beyond my hackery skills.

    Read the MFT is easier than write to it and some tool can do it, so is feasible. Look at ultrasearch and at this discussion that points to swiftsearch on sourceforge

    Also the Linux NTFS file system driver can be a reach source of information, if you are able to investigate a Linux driver.

    The task of reading MFT can be accomplished in other languages: see a very detailed answer on stackoverflow about C# and this python example

    See also analyzeMFT ntfswalk MTF_parser.

    For the thread part you can be interested in the marioroy's MCE that comes with many useful examples. At the monastery marioroy shown an exmple of dir walking using MCE in the thread Re: Perl threads - parallel run not working

    Update: ATTENTION play your MFT tests on a test harddisk because the risk of corruption is always present!

    Good luck and share your improvements!

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Just as a side note, multiple threads (up to a limit) definitely can access the filesystem faster than a single one. On unix, for example, I regularly do rm -rf by forking off (in perl) two sub-processes per directory for recursion (so you can get a lot of processes going at once). I doubt Windows is significantly different in this aspect - when the filesystem loads a given directory, it likely loads the full inode/sector and then hands back one item at a time, meanwhile another thread could request a different inode/sector and start acting on that. Meanwhile, any changes would be made in memory and sync'd out when the OS felt it either had to or had time to, so this can overall be much faster.

      Thanks for you suggestion, but indeed I dont dare to touch the MFT ( which is far beyond my ability and resources ) Ha!

      I had added a rewrite to my OP which can run as fast as dir does, and I think that's good enough for me, but please help to point out if there's any potential problems there. Thanks a lot!

Re: use threads for dir tree walking really hurts
by marioroy (Prior) on Sep 05, 2016 at 21:50 UTC

    Update: Improved performance on September 6, 2016.

    Some time ago, Yary emailed me a fast walker utilizing the MCE module. Here it is with some modifications.

    use strict; use warnings; # usage: ./count_files.pl /var/tmp [ /path/to/dir2 ... ] # usage: perl count_files.pl C:\ use Data::Dumper; use File::Spec::Functions qw( catfile ); use MCE; use MCE::Queue; use MCE::Shared; # Get directories to start with my @start_dirs = @ARGV or die "Please specify one or more starting directories, stopped"; for ( @start_dirs ) { -d or die "No such directory: $_, stopped"; } # Shared queue, counter, and arrays my $shared_work = MCE::Queue->new( fast => 1, queue => \@start_dirs ); my $free_count = MCE::Shared->scalar( 0 ); my $isDir = MCE::Shared->array; my $isFile = MCE::Shared->array; my $denied = MCE::Shared->array; # Walk routine sub traverse { $free_count->decr; my ( $dir, $path ); my @work = $_; while ( defined ( $dir = shift @work ) ) { opendir DH, $dir or $denied->push( $dir ), next; $isDir->push( $dir ); my @files; for ( readdir DH ) { next if $_ eq '.' || $_ eq '..'; if ( -d ( $path = catfile( $dir, $_ ) ) ) { ( @work < 5 ) ? push @work, $path : $shared_work->enqueue( $path ); next; } elsif ( -f _ ) { push @files, $path; } } $isFile->push( @files ) if @files; } # Done with our work, let everyone know we're free if ( $free_count->incr == MCE->max_workers && !$shared_work->pendin +g ) { $shared_work->enqueue( (undef) x MCE->max_workers ); } } # Instantiate MCE and run. MCE->new( max_workers => 6, user_begin => sub { $free_count->incr; }, user_func => sub { traverse() while ( defined ( $_ = $shared_work->dequeue ) ); }, )->run; # Display counts. printf "count dirs : %d\n", $isDir->len; printf "count files : %d\n", $isFile->len; printf "count denied : %d\n", $denied->len; # print Dumper( $isDir->export ), "\n"; # print Dumper( $isFile->export ), "\n"; # print Dumper( $denied->export ), "\n";

    Best regards, Mario

      I never got around to the blog post I'd intended for the above code- some commentary:
      I doubt that more CPU can fetch a physical hard drive faster than a single one.
      When in doubt, benchmark! Both Mario & I timed a variety of ways to walk a directory tree, and well-written multi-threaded file walkers generally beat out the single-threaded ones. I'm not able to dig up my variants now, but if you're curious, compare what Mario posted, and the OP's working code, with File::Find on a few directory trees on the systems of your choice.

      My goal was to divide up the work with as little overhead as possible. And I wanted it general enough so that all threads would be busy, regardless of working in 1 directory with a million files, or a directory tree with a million leaves and no files.

      Above is the result of about eight different methods, with Mario's help- and there are still a couple more ideas I'd like to benchmark, whenever I can get back to it- I think there are still more efficient ways!

      One thing benchmarking shows, is that even though multi-threaded walkers can finish faster- they can use less wallclock time- CPU time always increases: it simply takes more total cycles, when you add up the work from all CPUs. Be careful of what the Benchmark module is really telling you!

      (Thanks Mario for making my code look better than the original & tweaking it too.)

        The following is an attempt for having all workers remain busy, "regardless of working in 1 directory with a million files, or a directory tree."

        For extra performance, the execution time completes ~ 25% faster via cperl-5.22c.

        use strict; use warnings; # usage: ./count_files.pl /var/tmp [ /path/to/dir2 ... ] # usage: perl count_files.pl C:\ use Data::Dumper; use File::Spec::Functions qw( catfile ); use MCE; use MCE::Queue; use MCE::Shared; # Get directories to start with my @start_dirs = @ARGV or die "Please specify one or more starting directories, stopped"; for ( @start_dirs ) { -d or die "No such directory: $_, stopped"; } # Shared queue, counter, and arrays my $shared_work = MCE::Queue->new( fast => 1, queue => \@start_dirs ); my $free_count = MCE::Shared->scalar( 0 ); my $isDir = MCE::Shared->array; my $isFile = MCE::Shared->array; my $denied = MCE::Shared->array; # Walk routine sub traverse { $free_count->decr; my ( $entry, $path ); my @work = ( $_ ); my $count = 0; while ( defined ( $entry = shift @work ) ) { if ( ref $entry ) { $isFile->push( @{ $entry } ); } else { opendir DH, $entry or $denied->push( $entry ), next; $isDir->push( $entry ); my @files; for ( readdir DH ) { next if $_ eq '.' || $_ eq '..'; if ( -d ( $path = catfile( $entry, $_ ) ) ) { ( @work < 5 ) ? push @work, $path : $shared_work->enqueue( $path ); next; } elsif ( -f _ ) { push @files, $path; if ( ++$count == 2000 ) { $shared_work->enqueue( \@files ); $count = 0, @files = (); } } } $isFile->push( @files ) if @files; } } # Done with our work, let everyone know we're free if ( $free_count->incr == MCE->max_workers && !$shared_work->pendin +g ) { $shared_work->enqueue( (undef) x MCE->max_workers ); } } # Instantiate MCE and run. MCE->new( max_workers => 6, user_begin => sub { $free_count->incr; }, user_func => sub { traverse() while ( defined ( $_ = $shared_work->dequeue ) ); }, )->run; # Display counts. printf "count dirs : %d\n", $isDir->len; printf "count files : %d\n", $isFile->len; printf "count denied : %d\n", $denied->len; # print Dumper( $isDir->export ), "\n"; # print Dumper( $isFile->export ), "\n"; # print Dumper( $denied->export ), "\n";

        Best regards, Mario.

Re: use threads for dir tree walking really hurts
by Marshall (Canon) on Sep 01, 2016 at 05:25 UTC
    I am curious to see the single thread code that "wasn't fast enough". I suppose that it could possibly be an improvement there could make a big difference? I don't know. Just curious.