Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

smart glob of dated subfolders

by Anonymous Monk
on Feb 22, 2023 at 20:09 UTC ( [id://11150538]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a directories A, B, etc that contain dated subfolders in chronological order (YYYYMMDD format) and I am using glob to find ONLY the subfolders in the future. I cannot assume anything about the dates (i.e. how far into the past or future). This function, however, has become quite slow over time, so I am looking for a way to speed it up. I call this function quite often, so I'd like to make it as fast possible. I need to keep the history of dated folders, so deleting subfolders is NOT a solution.

It seems pointless to search through the entire history when I just need the future dates, so I'm looking for a way to glob ONLY the future dates. Maybe I can use readdir instead of glob in some smart way?

Below is a simplified version of my code. I apologize for any mistakes !!

################################################### ## folders + dated subfolders There are many !! ## A/<date1> = past ## A/<date2> = future ## B/<date3> = past ## B/<date4> = future ## .. ## function needs to return [ <date2>,A and <date4>,B ] ###################################################
sub _isGoodFolder { my ($theDate,$datePlusFolder,$attr) = @_; my ($date) = ($datePlusFolder =~ /\/(\d{8})$/); if ($date > $theDate) { ## compare folder date with today return $datePlusFolder; } return ''; } my $today = new Date::Business->image; my $dir = <basePath>; my $cwd = Cwd::getCwd(); chdir($dir); my @folderKeys = grep { _isGoodFolder($today,$_) } glob("*/*"); chdir($cwd); return [ map { join ",",(split "\/")[1,0] } @folderKeys ];

Replies are listed 'Best First'.
Re: smart glob of dated subfolders
by Corion (Patriarch) on Feb 22, 2023 at 20:18 UTC

    Both, glob and readdir will end up calling the same underlying C function, so the only way to actually avoid reading a large directory would be to introduce a processed/ folder where you move all folders that have already been processed.

      Both, glob and readdir will end up calling the same underlying C function
      use strict; use warnings; use Cwd; use Benchmark; my $dir = 'c:/windows'; my ( @a1, @a2 ); timethese 1, { glob => sub { my $cwd = getcwd; chdir $dir; @a1 = glob '*/*'; chdir $cwd; }, read => sub { my $cwd = getcwd; chdir $dir; opendir my $h, '.' or die; my @a = grep { $_ ne '.' and $_ ne '..' and -d $_ } readdir $h +; for my $d ( @a ) { opendir my $hh, $d or next; push @a2, map "$d/$_", grep { $_ ne '.' and $_ ne '..' } readdir $hh; } chdir $cwd; } }; use Test::More; is $#a1, $#a2, 'array lengths are equal' or do { use Test::Differences; eq_or_diff [ sort @a1 ], [ sort @a2 ], 'look deeper', { context => 0 };; }; done_testing;

      I don't care much about 7 (out of ~27e3) entries missing in one case (something to do with leading dot in a name), but I wonder if orders of magnitude speed difference is what OP is observing for his large tree. My Perl's latest Strawberry, + fast NVMe storage.

      Benchmark: timing 1 iterations of glob, read... glob: 4 wallclock secs ( 0.30 usr + 3.34 sys = 3.64 CPU) @ 0 +.27/s (n=1) (warning: too few iterations for a reliable count) read: 0 wallclock secs ( 0.03 usr + 0.06 sys = 0.09 CPU) @ 10 +.53/s (n=1) (warning: too few iterations for a reliable count) not ok 1 - array lengths are equal # Failed test 'array lengths are equal' # at glob.pl line 39. # got: '27633' # expected: '27640' not ok 2 - look deeper # Failed test 'look deeper' # at glob.pl line 41. # +----+-----+----+-------------------------------------------+ # | Elt|Got | Elt|Expected | # +----+-----+----+-------------------------------------------+ # | | * 656| 'INF/.NET CLR Data', * # | | * 657| 'INF/.NET CLR Networking', * # | | * 658| 'INF/.NET CLR Networking 4.0.0.0', * # | | * 659| 'INF/.NET Data Provider for Oracle', * # | | * 660| 'INF/.NET Data Provider for SqlServer', * # | | * 661| 'INF/.NET Memory Cache 4.0', * # | | * 662| 'INF/.NETFramework', * # +----+-----+----+-------------------------------------------+ 1..2 # Looks like you failed 2 tests of 2.
Re: smart glob of dated subfolders
by LanX (Saint) on Feb 22, 2023 at 20:28 UTC
    Add another tier of directories for years YYYY or additionally also month MM and reorganize your files accordingly.

    First globbing for relevant folders before parsing the files in pre-filtered places will keep your overhead relatively constant.

    Of course nothing will beat an in memory search or SQL database after synchronizing all file names.

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

Re: smart glob of dated subfolders
by cavac (Parson) on Feb 27, 2023 at 18:00 UTC

    It really depends on how up-to-date your information has to be.

    On my home storage (which has my accumulated digital junk of the past 25 or so years on slow, rotating disks), i run something like this periodically:

    find /media/large -type f > /media/large/filelist.txt find /media/large -type d > /media/large/dirlist.txt
    Edit: '-type d' for directories.

    Doing a "find" can take 30+ seconds, but

    cat filelist.txt | grep -i myfirstquad | grep jpg
    is a sub-second operation.

    PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP

      I'm sure you meant to type

      find /media/large -type d > /media/large/dirlist.txt
      and also for Linux systems there's a 'locate' so that
      locate -i myfirstquad | grep jpg
      should work as well. At least on my system, 'locate' gets refreshed once a day, or you can refresh it any time by running 'updatedb'.

        Thanks for catching the find typo.

        As for "locate": This works fine locally. But in my case, the storage is mounted over NFS on my workstation. Scanning the whole thing over the network is quite slow. So the storage itself (also running Linux) is making those 2 files and providing them in the NFS mount.

        PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
Re: smart glob of dated subfolders
by Anonymous Monk on Feb 23, 2023 at 10:19 UTC

    Is @a = glob("*/*"); fast enough compared to the whole program? What's your typical $#a? Your code does potentially huge amount of unnecessary string-to-number conversions and re-engine stuff.

      Thanks for your response ! typical $#a is only ~ 1 - 10

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11150538]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-25 04:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found