ovedpo15 has asked for the wisdom of the Perl Monks concerning the following question:

I would like to find all directories which contain a specific file. Until now, we used the following idea:
find( sub { get_dirs( \@list_of_dirs, $_ ) }, $root_path); sub get_dirs { my ($dirs_aref, $current_path) = @_; my $abs_path = abs_path($current_path); my $file = $abs_path."/"."secret.file"; my $ignore_file = $abs_path."/".".ignore"; push (@{$dirs_aref},$abs_path) if((-e $file) && !(-e $ignore_file) +); }
The problem is that finding over a large directory could take hours. I'm trying the reduce the waiting time.
My first idea was to split the directories into the process so they will perform a parallel search but I'm not sure if that a good idea.
Is it possible to suggest a better idea?

Replies are listed 'Best First'.
Re: Finding files recursively
by holli (Abbot) on Aug 04, 2019 at 18:09 UTC
    You are fighting the module. And you are doing a lot of unneccessary work. Consider
    use File::Find; my @found; my $path = 'd:\env\videos'; my $target = '2012.avi'; find ( sub { # We're only interested in directories return unless -d $_; # Bail if there is an .ignore here return if -e "$_/.ignore"; # Add to the results if the target is found here push @found, $File::Find::name if -e "$_/$target"; }, $path); print "@found";
    D:\ENV>perl pm10.pl d:\env\videos/2012 D:\ENV>echo.>d:\env\videos\2012\.ignore D:\ENV>perl pm10.pl D:\ENV>


    holli

    You can lead your users to water, but alas, you cannot drown them.
      Thanks for your suggestion but I don't understand the difference between both suggestions. Also, what is $target?. Thank you again.
        $target is just the filename you are looking for, "secret.file" in your case.
        The difference is that my code is exiting the wanted function immedeatly when it is not dealing with a directory. Only if there is a directory it is looking wether the target file is in that directory.

        Whereas your code looks at each and every file, calculates its' base path (albeit unneccessary, that info is already there in $File::Find::name). And then it takes that base directory to look for the target file.
        This, and this is the biggest slowdown, also means that you are testing the same directory number-of-entries-in-the-directory times.


        holli

        You can lead your users to water, but alas, you cannot drown them.
Re: Finding files recursively
by dsheroh (Monsignor) on Aug 05, 2019 at 08:04 UTC
    The problem is that finding over a large directory could take hours.
    How large are we talking? Does it take hours to run ls -RU over that directory? If so, then there's nothing you can do in Perl to do it faster because that's how long it takes for the disk to retrieve the directory entries. A quick test on my laptop suggests that 1 hour may correspond to about a million directory entries on this machine, but your hardware may vary. Wildly.

    Also, if you're on a *nix box, I'd be willing to bet that the OS's find binary is pretty well optimized. Generating a list of candidate directories with find $STARTING_DIR -name secret.file, then using Perl to run down that list and remove any with a .ignore file would probably be a pretty effective way to do this, albeit less effective as an exercise in using/learning more Perl, if that's your primary objective. There may even be a way to get find to filter out the directories with .ignore files in the first pass, so that you don't have to go back a second time to look for them, but my find-fu isn't up to that task.

    Even if you're going to ultimately write a Perl solution regardless, generating a list of all the secret.files with find is going to be a good sanity check to estimate the absolute fastest possible time the task could be done in.

    My first idea was to split the directories into the process so they will perform a parallel search but I'm not sure if that a good idea.
    If your bottleneck is on disk I/O rather than on processing, then parallelization won't help (if it's already waiting on the disk, having more CPU cores waiting isn't going to make the disk any faster) and may make things significantly worse (by making the disk spend more time jumping from one directory to another, and less time actually reading the data you want).
      Thanks for the reply!
      By parallel process, I meant to use fork(). Consider a directory with multiple subdirectories. I will use a fork and find all the valid directories for each one and then merge the arrays.
      Is it a bad idea?
Re: Finding files recursively
by bliako (Abbot) on Aug 05, 2019 at 02:11 UTC
    Those who cannot remember the past are condemned to repeat it.

    so do cache if the OS does not do this for you already

    and while you wait for your cache to build, this is worth reading (I found)

    ... Before reaching the final line, however, he had already understood that he would never leave that room, for it was foreseen that the city of mirrors (or mirages) would be wiped out by the wind and exiled from the memory of men at the precise moment when Aureliano Babilonia would finish deciphering the parchments, and that everything written on them was unrepeatable since time immemorial and forever more, because races condemned to one hundred years of solitude did not have a second opportunity on earth.

    bottomline: do cache but do not cache too much lest all be wiped out.

    bw, bliako

Re: Finding files recursively
by tybalt89 (Monsignor) on Aug 06, 2019 at 14:57 UTC

    If you are on a *nix box with locate, that might be faster.

      If you are on a *nix box with locate, that might be faster.

      But only if you search for files that existed the last time updatedb has been run. locate simply queries the database generated by updatedb. Depending on your system, updatedb runs from cron, or it has to be run manually. locate can't find files that did not exist while updatedb has run.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      Similarly mdfind -name foo for OS X (with the advantage that OS X's filesystem metadata DB is updated all but continuously).

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

A reply falls below the community's threshold of quality. You may see it by logging in.