comment on

I had a co-worker come to me with a surprising complaint that glob() was taking inordinate amounts of time, compared to a command-line listing of the same directory. I figured they should be directly the same as a readdir loop, but my tests show it's not.

In the distant past, somebody thought it would be a good idea to collect a number of data files in a single directory. Hundreds of thousands of data files in one directory. And the FORTRAN code is hardwired to that path. And this FORTRAN code was written in the 70s. And this FORTRAN code is not something anyone has authority, patience, budget, time or Congressional authority to go changing now.

This turns out to be a real nice torture test for certain perl scripts. To coin a catchy phrase, A big n case in a big O world.

A naive Windows command-line example. The ABC* pattern should select 540 files in this directory which contains over 210,000 filenames.

X:\database> perl -le "@foo = `dir /b ABC*`; print scalar @foo"
540
[download]

X:\database> perl -le "@foo = glob('ABC*'); print scalar @foo"
540
[download]

I ran both cases a number of times, to avoid any memory caching issues. The backtick version (even with the extra process overhead) ran in a couple of seconds, but the glob version took about 45 seconds to scan this massive directory of names.

I don't have the Perl 5.6.1 source code here, but it would seem that glob() must be collecting the whole file list in memory before weeding out the non-matches, or it must be naively starting at the top of the directory list on each iteration.

So, digging deeper, use Benchmark, I wrote three alternatives: opendir/readdir (rdglob), glob, and qx{dir /b}. (I added a DosGlob alternative, thanks to tye's suggestion below.) Turns out that glob appears between six and eight times slower than readdir, even for a smaller directory.


use strict;
use warnings;
use Benchmark;

use Cwd;

my $path = '\\\\big\unc\path\name';
my $wild = 'ABC*';

chdir($path);
my $cwd = cwd;
print $cwd, $/;

my $results =
timethese( 10,
       {
         'plglob' => sub { my @files = plglob($wild); die if not @file
+s },
         'rdglob' => sub { my @files = rdglob($wild); die if not @file
+s },
         'qxglob' => sub { my @files = qxglob($wild); die if not @file
+s },
         'fdglob' => sub { my @files = fdglob($wild); die if not @file
+s }
       } );

Benchmark::cmpthese( $results );

sub plglob
{
    my $wild = shift;
    return glob($wild);
}

sub rdglob
{
    my $wild = shift;
    opendir(DIR, '.') or return ();
    my @files;
    $wild = wildcard($wild);
    while (defined($_ = readdir(DIR)))
    {
    push(@files, $_) if m{$wild};
    }
    closedir(DIR);
    return @files;
}

sub qxglob
{
    my $wild = shift;
    my @files = `dir /b`;
    chomp @files;
    $wild = wildcard($wild);
    @files = grep { m{$wild} } @files;
    return @files;
}

sub fdglob
{
    my $wild = shift;
    use File::DosGlob;
    my @files = DosGlob($wild);
    return @files;
}

# turn a file mask like "ABC*.txt" into a regex like qr{ABC[^/\\]*\.tx
+t}
# (yes, this is a really fragile bug-prone way to go!)
sub wildcard
{
    my $wild = shift || '*';
    $wild =~ s!\?![^\/\\]!g;
    $wild =~ s!\*![^\/\\]*!g;
    $wild =~ s!\\!\\\\!g;
    $wild =~ s!\.!\\.!g;
    $wild = qr{\A$wild\z};
    return $wild;
}
[download]

Benchmark: timing 10 iterations of fdglob, plglob, qxglob, rdglob...
              Rate qxglob fdglob rdglob plglob
qxglob 1.27e-002/s     --  -100%  -100%  -100%
fdglob      3.44/s 26995%     --   -35%   -40%
rdglob      5.33/s 41871%    55%     --    -7%
plglob      5.71/s 44893%    66%     7%     --
[download]

{I'll post the code and its up-to-date in a few minutes.}

I'm not sure if I'm doing something wrong here. Wait, I am sure I'm doing something wrong, but I don't know WHAT.

--
[ e d @ h a l l e y . c c ]

In reply to glob() far slower than `dir` by halley

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.