halley has asked for the wisdom of the Perl Monks concerning the following question:
In the distant past, somebody thought it would be a good idea to collect a number of data files in a single directory. Hundreds of thousands of data files in one directory. And the FORTRAN code is hardwired to that path. And this FORTRAN code was written in the 70s. And this FORTRAN code is not something anyone has authority, patience, budget, time or Congressional authority to go changing now.
This turns out to be a real nice torture test for certain perl scripts. To coin a catchy phrase, A big n case in a big O world.
A naive Windows command-line example. The ABC* pattern should select 540 files in this directory which contains over 210,000 filenames.
X:\database> perl -le "@foo = `dir /b ABC*`; print scalar @foo" 540
X:\database> perl -le "@foo = glob('ABC*'); print scalar @foo" 540
I ran both cases a number of times, to avoid any memory caching issues. The backtick version (even with the extra process overhead) ran in a couple of seconds, but the glob version took about 45 seconds to scan this massive directory of names.
I don't have the Perl 5.6.1 source code here, but it would seem that glob() must be collecting the whole file list in memory before weeding out the non-matches, or it must be naively starting at the top of the directory list on each iteration.
So, digging deeper, use Benchmark, I wrote three alternatives: opendir/readdir (rdglob), glob, and qx{dir /b}. (I added a DosGlob alternative, thanks to tye's suggestion below.) Turns out that glob appears between six and eight times slower than readdir, even for a smaller directory.
use strict; use warnings; use Benchmark; use Cwd; my $path = '\\\\big\unc\path\name'; my $wild = 'ABC*'; chdir($path); my $cwd = cwd; print $cwd, $/; my $results = timethese( 10, { 'plglob' => sub { my @files = plglob($wild); die if not @file +s }, 'rdglob' => sub { my @files = rdglob($wild); die if not @file +s }, 'qxglob' => sub { my @files = qxglob($wild); die if not @file +s }, 'fdglob' => sub { my @files = fdglob($wild); die if not @file +s } } ); Benchmark::cmpthese( $results ); sub plglob { my $wild = shift; return glob($wild); } sub rdglob { my $wild = shift; opendir(DIR, '.') or return (); my @files; $wild = wildcard($wild); while (defined($_ = readdir(DIR))) { push(@files, $_) if m{$wild}; } closedir(DIR); return @files; } sub qxglob { my $wild = shift; my @files = `dir /b`; chomp @files; $wild = wildcard($wild); @files = grep { m{$wild} } @files; return @files; } sub fdglob { my $wild = shift; use File::DosGlob; my @files = DosGlob($wild); return @files; } # turn a file mask like "ABC*.txt" into a regex like qr{ABC[^/\\]*\.tx +t} # (yes, this is a really fragile bug-prone way to go!) sub wildcard { my $wild = shift || '*'; $wild =~ s!\?![^\/\\]!g; $wild =~ s!\*![^\/\\]*!g; $wild =~ s!\\!\\\\!g; $wild =~ s!\.!\\.!g; $wild = qr{\A$wild\z}; return $wild; }
Benchmark: timing 10 iterations of fdglob, plglob, qxglob, rdglob... Rate qxglob fdglob rdglob plglob qxglob 1.27e-002/s -- -100% -100% -100% fdglob 3.44/s 26995% -- -35% -40% rdglob 5.33/s 41871% 55% -- -7% plglob 5.71/s 44893% 66% 7% --
{I'll post the code and its up-to-date in a few minutes.}
I'm not sure if I'm doing something wrong here. Wait, I am sure I'm doing something wrong, but I don't know WHAT.
--
[ e d @ h a l l e y . c c ]
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: glob() far slower than `dir` (stat--)
by tye (Sage) on Feb 18, 2005 at 18:56 UTC | |
by halley (Prior) on Feb 18, 2005 at 20:00 UTC | |
|
Re: glob() far slower than `dir`
by NiJo (Friar) on Feb 18, 2005 at 22:07 UTC | |
by halley (Prior) on Feb 21, 2005 at 15:26 UTC | |
|
Re: glob() far slower than `dir`
by dragonchild (Archbishop) on Feb 18, 2005 at 18:34 UTC |