smart glob of dated subfolders

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a directories A, B, etc that contain dated subfolders in chronological order (YYYYMMDD format) and I am using glob to find ONLY the subfolders in the future. I cannot assume anything about the dates (i.e. how far into the past or future). This function, however, has become quite slow over time, so I am looking for a way to speed it up. I call this function quite often, so I'd like to make it as fast possible. I need to keep the history of dated folders, so deleting subfolders is NOT a solution.

It seems pointless to search through the entire history when I just need the future dates, so I'm looking for a way to glob ONLY the future dates. Maybe I can use readdir instead of glob in some smart way?

Below is a simplified version of my code. I apologize for any mistakes !!

###################################################
## folders + dated subfolders There are many !!

## A/<date1> = past
## A/<date2> = future
## B/<date3> = past
## B/<date4> = future
## ..
## function needs to return [ <date2>,A and <date4>,B ]
###################################################
[download]

sub _isGoodFolder {
  my ($theDate,$datePlusFolder,$attr) = @_;
   my ($date) = ($datePlusFolder =~ /\/(\d{8})$/);

   if ($date > $theDate) {  ## compare folder date with today
    return $datePlusFolder;
   }
  return '';
}

my $today = new Date::Business->image;
my $dir = <basePath>;
my $cwd = Cwd::getCwd();
chdir($dir);
my @folderKeys = grep { _isGoodFolder($today,$_) } glob("*/*");
chdir($cwd);
return [ map { join ",",(split "\/")[1,0] } @folderKeys ];
[download]

Comment on smart glob of dated subfolders Select or Download Code

Replies are listed 'Best First'.
Re: smart glob of dated subfolders by Corion (Patriarch) on Feb 22, 2023 at 20:18 UTC
Both, glob and readdir will end up calling the same underlying C function, so the only way to actually avoid reading a large directory would be to introduce a `processed/` folder where you move all folders that have already been processed.	[reply] [d/l]
Re^2: smart glob of dated subfolders by Anonymous Monk on Feb 25, 2023 at 11:35 UTC
Both, glob and readdir will end up calling the same underlying C function use strict; use warnings; use Cwd; use Benchmark; my $dir = 'c:/windows'; my ( @a1, @a2 ); timethese 1, { glob => sub { my $cwd = getcwd; chdir $dir; @a1 = glob '/'; chdir $cwd; }, read => sub { my $cwd = getcwd; chdir $dir; opendir my $h, '.' or die; my @a = grep { $_ ne '.' and $_ ne '..' and -d $_ } readdir $h +; for my $d ( @a ) { opendir my $hh, $d or next; push @a2, map "$d/$_", grep { $_ ne '.' and $_ ne '..' } readdir $hh; } chdir $cwd; } }; use Test::More; is $#a1, $#a2, 'array lengths are equal' or do { use Test::Differences; eq_or_diff [ sort @a1 ], [ sort @a2 ], 'look deeper', { context => 0 };; }; done_testing; [download] I don't care much about 7 (out of ~27e3) entries missing in one case (something to do with leading dot in a name), but I wonder if orders of magnitude speed difference is what OP is observing for his large tree. My Perl's latest Strawberry, + fast NVMe storage. Benchmark: timing 1 iterations of glob, read... glob: 4 wallclock secs ( 0.30 usr + 3.34 sys = 3.64 CPU) @ 0 +.27/s (n=1) (warning: too few iterations for a reliable count) read: 0 wallclock secs ( 0.03 usr + 0.06 sys = 0.09 CPU) @ 10 +.53/s (n=1) (warning: too few iterations for a reliable count) not ok 1 - array lengths are equal # Failed test 'array lengths are equal' # at glob.pl line 39. # got: '27633' # expected: '27640' not ok 2 - look deeper # Failed test 'look deeper' # at glob.pl line 41. # +----+-----+----+-------------------------------------------+ # \| Elt\|Got \| Elt\|Expected \| # +----+-----+----+-------------------------------------------+ # \| \| * 656\| 'INF/.NET CLR Data', * # \| \| * 657\| 'INF/.NET CLR Networking', * # \| \| * 658\| 'INF/.NET CLR Networking 4.0.0.0', * # \| \| * 659\| 'INF/.NET Data Provider for Oracle', * # \| \| * 660\| 'INF/.NET Data Provider for SqlServer', * # \| \| * 661\| 'INF/.NET Memory Cache 4.0', * # \| \| * 662\| 'INF/.NETFramework', * # +----+-----+----+-------------------------------------------+ 1..2 # Looks like you failed 2 tests of 2. [download]	[reply] [d/l] [select]
Re: smart glob of dated subfolders by LanX (Saint) on Feb 22, 2023 at 20:28 UTC
Add another tier of directories for years `YYYY` or additionally also month `MM` and reorganize your files accordingly. First globbing for relevant folders before parsing the files in pre-filtered places will keep your overhead relatively constant. Of course nothing will beat an in memory search or SQL database after synchronizing all file names. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re: smart glob of dated subfolders by cavac (Parson) on Feb 27, 2023 at 18:00 UTC
It really depends on how up-to-date your information has to be. On my home storage (which has my accumulated digital junk of the past 25 or so years on slow, rotating disks), i run something like this periodically: `find /media/large -type f > /media/large/filelist.txt find /media/large -type d > /media/large/dirlist.txt` [download] Edit: '-type d' for directories. Doing a "find" can take 30+ seconds, but `cat filelist.txt \| grep -i myfirstquad \| grep jpg` [download] is a sub-second operation. PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP	[reply] [d/l] [select]
Re^2: smart glob of dated subfolders by tybalt89 (Monsignor) on Feb 27, 2023 at 22:23 UTC
I'm sure you meant to type `find /media/large -type d > /media/large/dirlist.txt` [download] and also for Linux systems there's a 'locate' so that `locate -i myfirstquad \| grep jpg` [download] should work as well. At least on my system, 'locate' gets refreshed once a day, or you can refresh it any time by running 'updatedb'.	[reply] [d/l] [select]
Re^3: smart glob of dated subfolders by cavac (Parson) on Feb 28, 2023 at 07:12 UTC
Thanks for catching the find typo. As for "locate": This works fine locally. But in my case, the storage is mounted over NFS on my workstation. Scanning the whole thing over the network is quite slow. So the storage itself (also running Linux) is making those 2 files and providing them in the NFS mount. PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP	[reply]
Re: smart glob of dated subfolders by Anonymous Monk on Feb 23, 2023 at 10:19 UTC
Is `@a = glob("/");` fast enough compared to the whole program? What's your typical `$#a`? Your code does potentially huge amount of unnecessary string-to-number conversions and re-engine stuff.	[reply] [d/l] [select]
Re^2: smart glob of dated subfolders by Anonymous Monk on Feb 23, 2023 at 15:41 UTC
Thanks for your response ! typical $#a is only ~ 1 - 10	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks