in reply to Re^5: Finding files recursively
in thread Finding files recursively

Thank you for the good answer. It does reduce the time but not as much (like ~10 min of 4 hours), so I'm still hunting for more ideas.
In the following link: https://stackoverflow.com/questions/2681360/whats-the-fastest-way-to-get-directory-and-subdirs-size-on-unix-using-perl
Someone suggeted:

I once faced a similar problem, and used a parallelization approach to speed it up. Since you have ~20 top-tier directories, this might be a pretty straightforward approach for you to try. Split your top-tier directories into several groups (how many groups is best is an empirical question), call fork() a few times and analyze directory sizes in the child processes. At the end of the child processes, write out your results to some temporary files. When all the children are done, read the results out of the files and process them.

Is it possible to show what does he mean? I though maybe to implement a smart subroiute that can find big directories that contain subdirectories and use the idea to catch all the valid dirs and then merge into one array. Thank you again.

Replies are listed 'Best First'.
Re^7: Finding files recursively
by holli (Abbot) on Aug 05, 2019 at 21:38 UTC
    I would expect more than 4% speedup. You mentioned other users. Are you running this on some kind of shared network drive? If so, then THAT is your bottleneck. It's hard to say wether parallelization will speed up things without knowing more about the directory structure.


    holli

    You can lead your users to water, but alas, you cannot drown them.
      Tried a few tests, it always returns 10-15 min difference. We use VNC so other users also use the machine but it should not affect (as much) the penalty for searching. Isn't fork() a good idea when we have big directories?
        Isn't fork() a good idea when we have big directories?

        Your bottleneck is - assuming you are running the code directly on the machine - the disk system. There is an upper limit of bytes/sec that you can read through the disk interface (in case of SATA-III, about 6 GBit/s or about 600 MByte/s (see Serial_ATA). Your disk is usually much slower. Plus, there are seek times. The disk has to literally search for the directory on the disk. An arm carrying the read heads has to be moved over the surface of the disk. That takes some time, typically some milliseconds per read access. Estimating a fast disk, you will need about 1 sec per 1000 directories, maybe less, maybe more, only waiting for the seek time.

        Normally, the operating system (and the disk) caches some parts of the disk. But if you traverse the directory of the entire disk, or large parts of it, you will read more data than any cache will hold. Especially when traversing for the first time, your caches are "cold", i.e. have not yet read the data from disk. If you have insanely large amounts of RAM, your OS may have cached and read ahead a little bit during the scan. But generally, it did not.

        SSDs avoid the seek time, because nothing has to be moved. But you still have to read the data. NVMe-SSDs can be accessed using PCIe speeds of about 4 GByte/s (PCIe 3.0, 4 lanes). That's about 10 times faster than SATA-III, but SSDs rarely can deliver that speed, and even less SSDs can do so continuously without overheating.

        Now, what happens when you distribute the load over, say a thousand processes forked from the main process?

        Right, each process get's 1/1000 of the available bandwidth. So instead of reading 600 MByte/s from SATA-III to one process, you are reading 1000x 0.6 MByte/s into 1000 processes. Well, you are not. Switching between 1000 processes has some significant overhead, you are forcing the disk to seek even more, you are wasting RAM for processes that won't help you instead of using it for caching, and as explained before, your SATA-III disk won't be able to deliver 600 MByte/s to work with. So things become significantly WORSE. Feel free to replace 1000 by any other positive integer > 1.

        Now, networking, running your code on a computer not directly connected to the disks. Gigabit ethernet has a theoretical limit of 1 GBit/s = 100 MByte/s. Easily saturated by a single SATA-III interface. NVMe won't help you at all. The practical limit is less, at about 50 to 75 %, especiallly if you use more than two computers in the same network. Switching to the more expensive 10 GBit/s ethernet limits you to 1 GByte/s, barely enough for a single SATA-III interface. Throw in NVMe or a second SATA-III interface and you are again saturating the network interface. Forking new processes won't help you. The network interface is saturated. You can not get more data through it.

        Other people working on the same machine. Guess what happens. They also need the disk. They take away bandwidth and cause more seek times. Plus, they also need the CPU, slowing down your process(es). Again, forking won't help you.

        VNC. I like VNC, but it either needs a lot of network bandwidth to transport bitmap images of the remote screen, or it needs a lot of CPU time and memory to compress the bitmap images. If your code does not run on the machine connected to the disks, VNC steals bandwidth, memory, and CPU, even if only other people use VNC. Forking won't help you here.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        Are you processing the files you found in any way you don't show us? I'm still looking for an explanation for the measly speedup you're experiencing. Time for a reality check. How long does it take to run this?
        find /where/the/secret/files/are -name secret.file 1>secret-files.dat +2>/dev/null
        Also, how stressed is the server? Please try to find out about the CPU-load and the IO-load.


        holli

        You can lead your users to water, but alas, you cannot drown them.

        fork for as many physically different disks you have. not partitions, not directories.

Re^7: Finding files recursively
by Marshall (Canon) on Aug 09, 2019 at 08:23 UTC
    I am not sure about this idea, but it is an idea to try.
    File::Find calls the "wanted" sub for each "file" that it finds.
    A directory is actually a special kind of a file.

    When File::Find enters a directory, there is a pre-process sub that can be called for example to sort the order in which the files in that directory will be fed to the wanted() sub.

    Perhaps it could be that using this preprocess sub may make things faster? I don't know. I've never had to worry about performance at this level

    All of this File::Find stuff works upon the volume directory. All of that info will quickly become memory resident. The size of the disk and how much data is upon it doesn't matter.

    For your application, the number of directories matters. If you know all of the directories, the file system can determine quickly if the .ignore or the target file '2012.avi' exists in that directory or not. That sort of query could potentially be multi-threaded.

    There are ways in which your program can be informed by the O/S when a new directory is created. I suppose that if you know what the result was one hour ago, that might help with the calculation of the current result? The details of your app are a bit unclear to me.

    Anyway, below is an idea to benchmark. I don't know what the result will be.
    Code hasn't been run.. just an idea..

    use strict; use warnings; use File::Find; my @found; my $target = '2012.avi'; my %options = (preprocess =>\&preprocess_dir, wanted =>\&wanted ); find( \%options, "C:/test"); sub preprocess_dir { my @avi_path=(); foreach my $this_name (@_) { return () if ($this_name =~ /\.ignore$/); # defer judgement if $target is found push @avi_path, ("$File::Find::dir/".$target) if $this_name =~ +/$target/; } # ./ignore wasn't found push @found, @avi_path; return (); #Nothing for the wanted sub to do... } sub wanted {return();}; # nothing to do here # because list of files is always empty