glob() far slower than `dir`

halley has asked for the wisdom of the Perl Monks concerning the following question:

I had a co-worker come to me with a surprising complaint that glob() was taking inordinate amounts of time, compared to a command-line listing of the same directory. I figured they should be directly the same as a readdir loop, but my tests show it's not.

In the distant past, somebody thought it would be a good idea to collect a number of data files in a single directory. Hundreds of thousands of data files in one directory. And the FORTRAN code is hardwired to that path. And this FORTRAN code was written in the 70s. And this FORTRAN code is not something anyone has authority, patience, budget, time or Congressional authority to go changing now.

This turns out to be a real nice torture test for certain perl scripts. To coin a catchy phrase, A big n case in a big O world.

A naive Windows command-line example. The ABC* pattern should select 540 files in this directory which contains over 210,000 filenames.

X:\database> perl -le "@foo = `dir /b ABC*`; print scalar @foo"
540
[download]

X:\database> perl -le "@foo = glob('ABC*'); print scalar @foo"
540
[download]

I ran both cases a number of times, to avoid any memory caching issues. The backtick version (even with the extra process overhead) ran in a couple of seconds, but the glob version took about 45 seconds to scan this massive directory of names.

I don't have the Perl 5.6.1 source code here, but it would seem that glob() must be collecting the whole file list in memory before weeding out the non-matches, or it must be naively starting at the top of the directory list on each iteration.

So, digging deeper, use Benchmark, I wrote three alternatives: opendir/readdir (rdglob), glob, and qx{dir /b}. (I added a DosGlob alternative, thanks to tye's suggestion below.) Turns out that glob appears between six and eight times slower than readdir, even for a smaller directory.


use strict;
use warnings;
use Benchmark;

use Cwd;

my $path = '\\\\big\unc\path\name';
my $wild = 'ABC*';

chdir($path);
my $cwd = cwd;
print $cwd, $/;

my $results =
timethese( 10,
       {
         'plglob' => sub { my @files = plglob($wild); die if not @file
+s },
         'rdglob' => sub { my @files = rdglob($wild); die if not @file
+s },
         'qxglob' => sub { my @files = qxglob($wild); die if not @file
+s },
         'fdglob' => sub { my @files = fdglob($wild); die if not @file
+s }
       } );

Benchmark::cmpthese( $results );

sub plglob
{
    my $wild = shift;
    return glob($wild);
}

sub rdglob
{
    my $wild = shift;
    opendir(DIR, '.') or return ();
    my @files;
    $wild = wildcard($wild);
    while (defined($_ = readdir(DIR)))
    {
    push(@files, $_) if m{$wild};
    }
    closedir(DIR);
    return @files;
}

sub qxglob
{
    my $wild = shift;
    my @files = `dir /b`;
    chomp @files;
    $wild = wildcard($wild);
    @files = grep { m{$wild} } @files;
    return @files;
}

sub fdglob
{
    my $wild = shift;
    use File::DosGlob;
    my @files = DosGlob($wild);
    return @files;
}

# turn a file mask like "ABC*.txt" into a regex like qr{ABC[^/\\]*\.tx
+t}
# (yes, this is a really fragile bug-prone way to go!)
sub wildcard
{
    my $wild = shift || '*';
    $wild =~ s!\?![^\/\\]!g;
    $wild =~ s!\*![^\/\\]*!g;
    $wild =~ s!\\!\\\\!g;
    $wild =~ s!\.!\\.!g;
    $wild = qr{\A$wild\z};
    return $wild;
}
[download]

Benchmark: timing 10 iterations of fdglob, plglob, qxglob, rdglob...
              Rate qxglob fdglob rdglob plglob
qxglob 1.27e-002/s     --  -100%  -100%  -100%
fdglob      3.44/s 26995%     --   -35%   -40%
rdglob      5.33/s 41871%    55%     --    -7%
plglob      5.71/s 44893%    66%     7%     --
[download]

{I'll post the code and its up-to-date in a few minutes.}

I'm not sure if I'm doing something wrong here. Wait, I am sure I'm doing something wrong, but I don't know WHAT.

--
[ e d @ h a l l e y . c c ]

Comment on glob() far slower than `dir` Select or Download Code

Replies are listed 'Best First'.
Re: glob() far slower than `dir` (stat--) by tye (Sage) on Feb 18, 2005 at 18:56 UTC
glob on Win32 can be orders of magnitude slower than it should be. The C code (from bash and/or BSD?) used to implement the default glob does a stat on every single file. Use a different glob module to work around this problem. For example, just adding "-MFile::DosGlob=glob" should make a big difference. - tye	[reply]
Re^2: glob() far slower than `dir` (stat--) by halley (Prior) on Feb 18, 2005 at 20:00 UTC
Thanks, tye, I've added a dosglob test case to the benchmark. I found that our network performance varies widely, and dosglob and readdir were roughly comparable on most runs. Either one would be preferable to using backticks, of course. I didn't realize that DosGlob's version was so different, performance-wise. I figured it was mostly for the extended wildcard support and case-sensitivity issues. The scripts we write must be portable, but since it's available and works on our older Solaris setups, we can use that here, too. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: glob() far slower than `dir` by NiJo (Friar) on Feb 18, 2005 at 22:07 UTC
Your readdir example can be improved by using list context and less system calls: `sub read_dir_list{ opendir(DIR, '.'); my @all_files = readdir DIR; closedir DIR; my @files = grep /^ABC.*/ @all_files; return @files; }` [download] If you need to work with all the files condider sorting the file list by inodes. You'll use disk readahead instead of cache and disk seeks are minimized.	[reply] [d/l]
Re^2: glob() far slower than `dir` by halley (Prior) on Feb 21, 2005 at 15:26 UTC
The directory in question had 210,000 filenames in it. I was looking for about 500 names in that set. I realize that I could grep through a full set at once, but at the expense of about two~four megabytes of transient data being collected into memory. You're right, though; for benchmarking, I should consider adding that case. -- `[ e d @ h a l l e y . c c ]`	[reply]
Re: glob() far slower than `dir` by dragonchild (Archbishop) on Feb 18, 2005 at 18:34 UTC
While I don't have any direct impact on your problem, there are several other solutions to look at. IO::Dir - you might be able to use the tie interface and hook it up to a hash pre-tied to Tie::Regex::Hash. You could symlink the files to a real directory structure, maybe based upon the one CPAN uses. For example, IO::Dir is located at G/GB/GBARR/IO-Dir-x.xx.tar.gz. That way, the FORTRAN dreck and real code can both play nicely together. Since you're on Win32, there's bound to be a Win32 module that provides access to a system call that does what you need. Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.	[reply]