markkneen has asked for the wisdom of the Perl Monks concerning the following question:

I want to get a list of files from a very large directory (like 3.7million), using something like...

opendir(DIR, $some_path);
my @list=grep(!/^(\.+?)$/,readdir(DIR));
The problem is that it would return a very large array and takes ages to run.

What I would like to do is to limit the results by a certain date. for example return only files for 03/04/2004 etc. and for it to run reasonably quick.

Is there a way to achieve this?
Any assistance is most appreciated.
Regards Mark
  • Comment on using grep on a directory to list files for a single date

Replies are listed 'Best First'.
Re: using grep on a directory to list files for a single date
by zejames (Hermit) on Dec 01, 2004 at 13:43 UTC

    Just for fun, I wanted to measure the speed difference of greping and just using while.

    So I created, in a test directory, lots of small files :

    $dir = "test"; mkdir $dir or die "Unable to create dir : $!" if not ( -d "$dir"); chdir $dir; foreach ( 'aaa' .. 'zzz' ) { open F, "> $_"; my $data = chr(97 + int rand 10); print F $data; close F; }

    Then I tried to list each file of this directory, and compare :</o>

    use Benchmark qw/cmpthese/; $dir = "test"; opendir DIR, "< $dir"; cmpthese(1000, { 'grep' => sub { opendir DIR, "$dir" or die "Unable to open dir : $!\n"; @list=grep(!/^(\.+?)$/,readdir(DIR)); closedir DIR; }, 'while' => sub { opendir DIR, "$dir" or die "Unable to open dir : $!\n"; while (readdir(DIR)) { push @list, $_ unless /^(\.+?)$/; closedir DIR; } } });

    As expected, the difference is huge :

    D:\Perl\bin>perl test2.pl Rate grep while grep 6.51/s -- -100% while 2667/s 40833% -- D:\Perl\bin>

    Using grep, perl interprets readdir in list context, and builds and return the whole list of files of the directory, that is huge.

    When using while, perl returnes file names each by each, which is much cheaper in memory.

    So, in your case : use while.

    For information, I was using Windows XP SP1 and ActivePerl 5.8.4 on a NTFS file system.

    HTH


    --
    zejames
      OK, sort of got somthing working but im sure there is a more "efficent" way to do it as its still returning a large array and loads of the elements are empty???
      sub list{ my $path=shift; my $comp=shift; if (! -e $path){die "Error : $path $!\n";} opendir(DIR,$path) or die "Error : $path $!\n"; return sort map { my ($d,$m,$y) = (localtime( (stat "$path/$_")[9] ) )[3..5]; $m+=1; $y+=1900; $m=($m<10)?"0$m":$m; $d=($d<10)?"0$d":$d; my $date = "$d/$m/$y"; if($date eq $comp){"$_\n"}; } grep(!/^(\.+?)$/,readdir(DIR)); }
      any ideas??
      Thanks for you help on this so far.
      (goin to try the while() loop next)

        What is the if in the map trying to do?
        if $date eq $comp is false, map adds an undef to the returned list.
        if $date eq $comp is true, map returns "$_\n".
        Below, I assume that you're were trying to filter out dates that don't match. Filtering is grep's job, not map's. The "empty" elements you're getting are the undef returned by map when $date eq $comp is false.

        $! doesn't have any meaningful value after calling -e.

        The -e is redundant. opendir will fail if the dir doesn't exist, and you already handle that.

        The capture in /^(\.+?)$/ wastes time. The ? is meaningless. I wonder if $_ eq '.' || $_ eq '..' would be faster.

        It's probably faster to divide $comp in $year, $month, $day than to convert all the mtimes to strings.

        sub list { my ($path, $comp) = @_; $comp =~ m#^(..)/(..)/(....)$# or die("Error: Badly formatted \$comp.\n"); my $comp_d = $1; my $comp_m = $2; my $comp_y = $3; local *DIR; opendir(DIR, $path) or die("Error: Unable to open directory $path: $!\n"); my @filtered_listing; while (<DIR>) { next if /^\.+$/; my ($mtime_d, $mtime_m, $mtime_y) = (localtime( (stat "$path/$_")[9] ) )[3..5]; next unless ( $mtime_d == $comp_d && $mtime_m == $comp_m && $mtime_y == $comp_y ); push(@filtered_listing, $_); } return sort @filtered_listing; }

        ikegami has already posted an excellent reply, showing exactly how to do it with while. That is probably the best way to solve this particular problem, but I thought I would show you how to use map to filter out elements, for your future reference:

        my @array = qw(foo bar baz qux); my @newarray = map { my $foo = $_; $foo =~ s/./\u$&/; # useless example $foo =~ /Ba/ ? $foo : () } grep { /a/ } @array;

        The key here is to return an empty list when the condition fails. It's neat that we can do this with map, but it's usually better to use another grep:

        my @array = qw(foo bar baz qux); my @newarray = grep { /Ba/ } map { my $foo = $_; $foo =~ s/./\u$&/; # useless example $foo } grep { /a/ } @array;

        HTH

Re: using grep on a directory to list files for a single date
by fglock (Vicar) on Dec 01, 2004 at 13:12 UTC

    I think that parsing the output of a shell command is the fastest you can go:

    ls -R  --time=ctime --time-style="+%Y-%m-%d" -g -o

    This gives a text like:

    ./htdocs/gui/img/control/default/cs-iso: total 160 -rw-r-Sr-- 1 1426 2004-10-28 cs-iso_abschic.gif -rw-r-Sr-- 1 1778 2004-10-28 cs-iso_admission_data.gif -rw-r-Sr-- 1 1479 2004-10-28 cs-iso_admit-blue.gif ...
Re: using grep on a directory to list files for a single date
by Jaap (Curate) on Dec 01, 2004 at 12:53 UTC
    You could use the -M operator to check the last modification date. Make sure you don't load the whole array into memory first.
      Thats the sort of Idea I had in mind but not sure how. I know I could use map but i dont have a clue how.
      Thanks...
        If you don't know how to map/grep, just use a while loop. They tend to be less obfuscated.
Re: using grep on a directory to list files for a single date
by NiJo (Friar) on Dec 01, 2004 at 19:49 UTC
    The key to a solution is reducing disk seeks. I don't know about NTFS, but in Unix there is the 'directory file' (name and inode number->inode (with most of stat() info)->file sectors hierarchy. NTFS should have something similar.

    If you manage to get all of your information from the 'directory file' only, this can be done in one big read. E. g. you can use an existing naming scheme of the files.

    The other thing is to play OS. If you need to read the (still some 1000?) filtered files, I'd stat them first. This requires an inode read. But before processing the files one by one, sort them by inode number. You help the OS to reduce disk seeks and to utilize read ahead caches. In one of my toy programms sorting dramatically changes disk sound from noisy screetching to a quiet toktoktok. And it was a lot faster.

      Thanks - you have all been fantastic. Really greatfull for everybodies assistance

      I had a bit of a search around for info about NTFS file table (Master File Table) but not found much in the way of accessing through perl. There is a win32::adminMisc which sounds like it might be up to the job but still reading up on it :(

      Had to make a few changes to the script to get it to run
      while (my $file = readdir(DIR)) {
      while(<DIR>){ <- didn't work. (well not for me).

      Once again - Thanks...
      Mark K

        angle bracks <...> can only be used with file handles, not directory handles.


        --
        zejames
The simpler way
by Luca Benini (Scribe) on Dec 02, 2004 at 12:41 UTC
    find file modificated 3 day ago:
    find2perl -mtime 3 > a.pl
    find file modificated from 3 day ago:
    find2perl -mtime -3 > b.pl
    ... and after change the resulting script

    See: man (find|find2perl)