in reply to Re: File::Find won't iterate through CIFS share withouth "dont_use_nlink"
in thread File::Find won't iterate through CIFS share withouth "dont_use_nlink"

Thanks for the fast reply. Nothing to be done about nlink, then. I got it.

I know this is a bit off topic, but do you think File::Find would be the best option for me to iterate folder trees where the structure is always the same? (1st level = Year; 2nd level = month; 3rd level = day; 4th level = hour;). e.g: "/mnt/server01/2017/01/30/07/audiofile01.wav"

Only the last level contains files, which are always audios files. As you can see, the structure is constant (even though folders are created daily/hourly) and the content is quite predictable. Should I be using File::Find for this? Or a simple iteration (taking into account concerns with loops caused by links pointing to upper dirs, and everything else) would be enough? Also, the inability to search with File::Find using threads is a big drawback to me.

  • Comment on Re^2: File::Find won't iterate through CIFS share withouth "dont_use_nlink"

Replies are listed 'Best First'.
Re^3: File::Find won't iterate through CIFS share withouth "dont_use_nlink"
by Corion (Patriarch) on Jan 30, 2017 at 13:08 UTC

    Depending on what your actual needs are, your situation could lend itself very neatly to a parallel approach, provided that reading the files actually is your bottleneck.

    As you know the fixed structure of the files, you can simply spawn 24 copies of the processing program which will read their respective hourly directories and process those. This should be faster than iterating the whole directory tree and processing items as they are found.

    Also potentially fast might be your OS equivalent of launching open my $entries, "ls -1R $basedir |" and reading from <$entries>, as that allows the reading of files in a separate program.

    You could also consider having a process on the remote machine scan the directories and add new entries to a database and then accessing the database from your other machine for processing, but that risks files getting "forgotten" if they never enter the database or get moved to a different location in a manual process.

    Also consider whether you really want to iterate over your complete tree of incoming data folders - judicious use of $File::Find::prune can let File::Find skip a whole year if you can be certain that there can be no new items to process there.

      Reading is definitely my bottleneck. The disk which I read from is working close to its limit, and still the server running the script has some idle CPU time, due to the remote disk's low output. Therefore I need to read from multiple servers (disks) at once.

      I believe I do need to iterate, since I don't know which folders might be missing (deleted due to being empty). Remote processes aren't an option, as I have a couple hundreds (windows) servers to search at.

      The flow is as follows:

      1)Audio files are created on remote servers by some applications.

      2)Boss thread searches for filenames on remote servers using File::Find, and enqueues them

      3)Worker threads dequeue items and call a conversion application, which converts the file and sends it to a storage (it is never written to the local server's HDD)

      4)Upon success, audio file is deleted from remote server

      Right now, I need a faster way to search those filenames. I'm trying to grab a few filenames from each remote server, cycling through them (servers) over and over, but I'm not being to feed the queue as fast as it is consumed. So I need to either make it search (and cycle) much faster, or truly make the search multi-threaded.

        If your bottleneck is the I/O performance of each server, I would launch a process per-server to scan the directories and potentially (I/O performance permitting) more processes to process each found file from a queue.

        You seem to have the part of enqueueing and processing the found files in parallel already down, but short of eliminating the network bottleneck by launching processes on the remote servers or by launching local processes for each server, I don't see how your approach could be made faster.

        You could periodically launch the equivalent of

        dir /b /s c:\audiofiles\ > \\centralserver\incoming\audio-%hostname%.l +og

        ... or use Perl to do the same. That would somewhat eliminate the network traffic. But if your scanning is bound by the ability of the (windows server) hard disks to deliver the directory entries, that won't help much either.

        Before you embark on a longer programming spree, you could do a test in the shell to see whether the parallel performance is actually faster than the serial scanning:

        TS=$(date '%Y%m%d-%H%M%S') for server in (server01 server02 server03); do ls -1R /mnt/$server/incoming > /tmp/$server-$TS.log & done

        This will try to scan three servers in parallel.