You should probably test on a smaller data set then? Anyway, I'm getting different results, my original code being roughly 55% faster on my single user machine (as expected).
I added a native Perl implementation that walks the tree itself with no overhead and that gains you another significant speed boost.
D:\ENV>perl pm10.pl
Holli (New). Found: 1 ( D:\env\Videos/2012 )
Time: -19
Holli (original). Found: 1 ( d:\env/Videos/2012 )
Time: -32
ovedpo15. Found: 1 ( d:/env/Videos/2012 )
Time: -51
Using this code.
holli
You can lead your users to water, but alas, you cannot drown them.
| [reply] [d/l] [select] |
Thank you for the good answer. It does reduce the time but not as much (like ~10 min of 4 hours), so I'm still hunting for more ideas.
In the following link: https://stackoverflow.com/questions/2681360/whats-the-fastest-way-to-get-directory-and-subdirs-size-on-unix-using-perl
Someone suggeted:
I once faced a similar problem, and used a parallelization approach to speed it up. Since you have ~20 top-tier directories, this might be a pretty straightforward approach for you to try. Split your top-tier directories into several groups (how many groups is best is an empirical question), call fork() a few times and analyze directory sizes in the child processes. At the end of the child processes, write out your results to some temporary files. When all the children are done, read the results out of the files and process them.
Is it possible to show what does he mean? I though maybe to implement a smart subroiute that can find big directories that contain subdirectories and use the idea to catch all the valid dirs and then merge into one array. Thank you again.
| [reply] |
| [reply] |
I am not sure about this idea, but it is an idea to try.
File::Find calls the "wanted" sub for each "file" that it finds.
A directory is actually a special kind of a file.
When File::Find enters a directory, there is a pre-process sub that can be called for example to sort the order in which
the files in that directory will be fed to the wanted() sub.
Perhaps it could be that using this preprocess sub may make things faster? I don't know. I've never had to worry about performance at this level
All of this File::Find stuff works upon the volume directory. All of that info will quickly become memory resident. The size of the disk and how much data is upon it doesn't matter.
For your application, the number of directories matters. If you know all of the directories, the file system can determine quickly if the .ignore or the target file '2012.avi' exists in that directory or not. That sort of query could potentially be multi-threaded.
There are ways in which your program can be informed by the O/S when a new directory is created. I suppose that if you know what the result was one hour ago, that might help with the calculation of the current result? The details of your app are a bit unclear to me.
Anyway, below is an idea to benchmark. I don't know what the result will be.
Code hasn't been run.. just an idea..
use strict;
use warnings;
use File::Find;
my @found;
my $target = '2012.avi';
my %options = (preprocess =>\&preprocess_dir,
wanted =>\&wanted
);
find( \%options, "C:/test");
sub preprocess_dir
{
my @avi_path=();
foreach my $this_name (@_)
{
return () if ($this_name =~ /\.ignore$/);
# defer judgement if $target is found
push @avi_path, ("$File::Find::dir/".$target) if $this_name =~
+/$target/;
}
# ./ignore wasn't found
push @found, @avi_path;
return (); #Nothing for the wanted sub to do...
}
sub wanted
{return();}; # nothing to do here
# because list of files is always empty
| [reply] [d/l] |