File::Find won't iterate through CIFS share withouth "dont_use

reinaldo.gomes has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: File::Find won't iterate through CIFS share withouth "dont_use_nlink" by ikegami (Patriarch) on Jan 30, 2017 at 15:57 UTC
The optimization used by `dont_use_nlink => 0` relies on a feature not available on FAT and NTFS file systems, and you appear to be using such a file system. The optimization only helps if you have empty directories (i.e. directories with no files except `.` and `..`). If you have few such directories, the optimization isn't helping anyway. If you have many such directories, maybe you could delete them in advance to gain the same benefit as the optimization. An explanation of the optimization: On unix file systems, a directory's `.` is a hardlink to itself, and a directory's `..` is a hardlink to its parent directory. So when you `stat` a directory, the link count returned by `stat` will be at least `1 (name) + 1 (.) + $num_sub_dirs (..)`. `$ ls -ld . drwx------ 5 ikegami ikegami 46 Dec 16 12:03 . # 5: Up to 3 subdirs $ ls -l . total 0 drwx------ 2 ikegami ikegami 10 Dec 16 12:03 a # 2: Empty drwx------ 3 ikegami ikegami 24 Dec 16 12:03 b # 3: Up to 1 subdir drwx------ 2 ikegami ikegami 10 Dec 16 12:03 c # 2: Empty` [download] File::Find relies on that information to optimize itself when possible. Perl and File::Find know this isn't the case for the FAT and NTFS file systems, so the optimization is disabled on Windows.	[reply] [d/l] [select]
Re^2: File::Find won't iterate through CIFS share withouth "dont_use_nlink" by dave_the_m (Monsignor) on Jan 31, 2017 at 08:48 UTC
The optimization only helps if you have empty directories (i.e. directories with no files except . and ..) Erm, actually the optimisation is used to detect directories that contain no subdirectories (rather than that are empty directories). That way, for leaf node directories, you don't have to stat() every directory entry looking for possible subdirectories. Dave.	[reply]
Re^3: File::Find won't iterate through CIFS share withouth "dont_use_nlink" by ikegami (Patriarch) on Feb 02, 2017 at 17:17 UTC
Oops, yeah. Ironically, the information `stat` is used to fetch is provided by the native equivalent of `readdir` in Windows, so using the native call instead of `readdir` would gain the effect of the optimization, and not just for leaf directories!	[reply] [d/l] [select]
Re: File::Find won't iterate through CIFS share withouth "dont_use_nlink" by Corion (Patriarch) on Jan 30, 2017 at 12:42 UTC
The situation is more that File::Find contains far too much argumentation about interacting with Unix filesystems that maybe were true decades ago but certainly are not even true on Unix filesystems nowadays. IMO, a better approach nowadays would be to always set `$dont_use_nlink` to a true value and only fall back to that optimization if the user can make certain that the `nlink` field is valid for this filesystem (which is quite hard to do reliably). In your case, I would assume that while performance may be critical, correctness still beats performance. So I would compare the two.	[reply] [d/l] [select]
Re^2: File::Find won't iterate through CIFS share withouth "dont_use_nlink" by reinaldo.gomes (Beadle) on Jan 30, 2017 at 12:57 UTC
Thanks for the fast reply. Nothing to be done about nlink, then. I got it. I know this is a bit off topic, but do you think File::Find would be the best option for me to iterate folder trees where the structure is always the same? (1st level = Year; 2nd level = month; 3rd level = day; 4th level = hour;). e.g: "/mnt/server01/2017/01/30/07/audiofile01.wav" Only the last level contains files, which are always audios files. As you can see, the structure is constant (even though folders are created daily/hourly) and the content is quite predictable. Should I be using File::Find for this? Or a simple iteration (taking into account concerns with loops caused by links pointing to upper dirs, and everything else) would be enough? Also, the inability to search with File::Find using threads is a big drawback to me.	[reply]
Re^3: File::Find won't iterate through CIFS share withouth "dont_use_nlink" by Corion (Patriarch) on Jan 30, 2017 at 13:08 UTC
Depending on what your actual needs are, your situation could lend itself very neatly to a parallel approach, provided that reading the files actually is your bottleneck. As you know the fixed structure of the files, you can simply spawn 24 copies of the processing program which will read their respective hourly directories and process those. This should be faster than iterating the whole directory tree and processing items as they are found. Also potentially fast might be your OS equivalent of launching `open my $entries, "ls -1R $basedir \|"` and reading from `<$entries>`, as that allows the reading of files in a separate program. You could also consider having a process on the remote machine scan the directories and add new entries to a database and then accessing the database from your other machine for processing, but that risks files getting "forgotten" if they never enter the database or get moved to a different location in a manual process. Also consider whether you really want to iterate over your complete tree of incoming data folders - judicious use of `$File::Find::prune` can let File::Find skip a whole year if you can be certain that there can be no new items to process there.	[reply] [d/l] [select]
Re^4: File::Find won't iterate through CIFS share withouth "dont_use_nlink" by reinaldo.gomes (Beadle) on Jan 30, 2017 at 15:27 UTC
Re^5: File::Find won't iterate through CIFS share withouth "dont_use_nlink" by Corion (Patriarch) on Jan 30, 2017 at 15:46 UTC