print all files is soo slow! Why?

harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

I want to write a simple perl program that goes through the current folder recursively and looks at each item, if it's a file, print its contents and moves on to the next one.

I am a beginner Perl programmer, and I don't know why this is happening, but this perl program stops responding, and then I have to kill it every time.

I am using TinyPerl 5.8 under Windows 7.

I am testing this program in a folder that has about 50 text files and no folders. When I launch my perl program, it seems to work perfectly fine, but it doesn't go back to command prompt. It just hangs. The cursor stops blinking, and I have to kill the process.

If I only read the first 100 bytes of each file, then there's no issue. I am back to command prompt immediately. If I read the first 2000 bytes, then there's a 3-second delay. And I tried reading the entire file and print the content, but then after executing the script, it stopped responding and I had to kill it.

use strict;
use warnings;

my $PATH = '.';
my $CONTENT;

explore($PATH);

sub explore
{
  my $PATH = shift;
  my $FILE;
  my $SUB;

  opendir(my $DIR, $PATH) or return;
  while (my $SUB = readdir $DIR)
  {
    next if $SUB eq '.' or $SUB eq '..';
    $SUB = "$PATH/$SUB";

    # If it's a folder, explore it.
    # If it's a file, print it.

    if (-d $SUB)
    {
      explore($SUB);
      next;
    }
    if (-f $SUB)
    {
      open($FILE, '<:raw', $SUB) or next;
      read($FILE, $CONTENT, 2000) or next;
      close($FILE);
      print $CONTENT;
      $CONTENT = '';  # don't need this data anymore
    }
  }
  close $DIR;
}
[download]

Comment on print all files is soo slow! Why? Download Code

Replies are listed 'Best First'.
Re: print all files is soo slow! Why? by haukex (Archbishop) on Jul 26, 2016 at 10:35 UTC
Hi harangzsolt33, The problem you're experiencing does sound strange to me, especially with the behavior you describe here, but it's been a while since I worked with Perl on the command line in Windows - the explanations by the AM (here) seem plausible if you're trying to print lots of binary data. I just wanted to point out that Perl has an operator that tries to tell "text" from "binary" files based on a heuristic: -T (the heuristic is described in the documentation). Also, while walking a directory tree yourself is certainly a useful exercise, there are modules to help you, here's one example with Path::Class: `use warnings; use strict; use Path::Class qw/file dir/; my $PATH = dir('.'); $PATH->recurse( callback => sub { my $file = shift; return if $file->is_dir \|\| -B $file; my $fh = $file->open('<:raw') or return; read $fh, my $content, 2000 or return; close $fh; print $content; } );` [download] Just a small note, in your current code you have a few variables that could have better scoping. `$CONTENT` could be defined right before its use in read, then you don't have to clear it every time. And the `$SUB` you define at the beginning of `explore` is actually never used because it is shadowed by the second `$SUB` in the `while` loop. Hope this helps, -- Hauke D	[reply] [d/l] [select]
Re: print all files is soo slow! Why? by afoken (Chancellor) on Jul 27, 2016 at 06:01 UTC
Perhaps not related to the speed problem: Your code may open a lot of directory handles while recursing, blocking some resources. File and directory handles should generally be treated as a limited resource. Your code should close the directory handle BEFORE recursing into the filesystem, not after. You could read the entire directory contents into an array, close the handle, and then iterate over the array. An even better way would be a queue (think of it as a to-do-list) instead of using recursion. The queue is a simple array that starts with the directory to be "explored". While the array is not empty, shift out the first element and use it as a directory name to open a directory handle. Read all directory elements, push subdirectories to the array, handle non-directories directly in the loop. Close the directory handle. Perhaps related to the speed problem: If the output is initially fast and slows down over time, you are leaking resources, forcing the system to start swapping. Your code uses close instead of closedir to close `$DIR`. `close` can not close the directory handle. You would have noticed that if you had added proper error handling (`... or die "Can't close: $!"` or autodie): `>perl -Mstrict -w -e 'opendir my $dir,"." or die "opendir: $!";close $ +dir or die "close: $!";' close: Bad file descriptor at -e line 1.` [download] Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re: print all files is soo slow! Why? (autoflush, cmd.exe) by Anonymous Monk on Jul 26, 2016 at 01:58 UTC
Maybe its because autoflush is off, so try `use IO::Handle; STDOUT->autoflush(1);` [download] but then cmd.exe is also slow, it doesn't like when you print lots and lots of stuff, so try redirecting output to a file `perl yada.pl > yada.txt`	[reply] [d/l] [select]
Re^2: print all files is soo slow! Why? (autoflush, cmd.exe) by harangzsolt33 (Deacon) on Jul 26, 2016 at 04:31 UTC
Thank you, but that didn't solve the problem. I had a binary file in the folder, which is what messed up my program I guess. Somehow I overlooked it. I wasn't going to print binary files, only text files. And once I moved those binary files out into another folder, my program ran correctly. Instead of using print $CONTENT, I made a for loop that printed the characters one by one skipping thru all the special characters such tab, bell, new line, backspace, etc, and there was no more delay. So, that solved it. (As long as I read the binary files, everything was okay. But when I tried printing them, there was a lot of delay.)	[reply]
Re: print all files is soo slow! Why? (stat, ntfs, links) by tye (Sage) on Jul 27, 2016 at 12:10 UTC
I suspect that the slowness is largely due to the fact that the Perl code can't resist doing a stat on each file found and Perl's emulation of stat(2) on Windows does extra work to ask for the count of "links" that exist to that file. Unfortunately, ntfs supports hard links in some way such that the number of hard links is not efficiently cached as in a Unix inode and so the code to look up the link count sometimes does things that can take significantly longer than would be taken by only use of FindNextFile. See p5git://win32/win32.c.: `if (!w32_sloppystat) { /* We must open & close the file once; otherwise file attribut +e changes / / might not yet have propagated to "other" hard links of the +same file. / / This also gives us an opportunity to determine the number o +f links. / HANDLE handle = CreateFileA(path, 0, 0, NULL, OPEN_EXISTING, 0 +, NULL); if (handle != INVALID_HANDLE_VALUE) { BY_HANDLE_FILE_INFORMATION bhi; if (GetFileInformationByHandle(handle, &bhi)) nlink = bhi.nNumberOfLinks; CloseHandle(handle); }` [download] It is my experience that the time taken by that code can be fairly short but sometimes is pronounced (and seems to at least nearly lock up much of Windows and so feels like some kind of interlock that also involves networking calls). Though I have yet to find technical details about what is going on. It is too bad that one can't easily arrange for `w32_sloppystat` to be true for the many cases when one would like stat to be fast at the expense of things that very often won't matter much to Win32 uses of Perl code. `#ifdef PERL_IS_MINIPERL w32_sloppystat = TRUE; #else w32_sloppystat = FALSE; #endif` [download] It would quite nice if that unconditional `FALSE` were instead a lookup of some environment variable, like PERL_WIN32_SLOPPY_STAT. (Update:* Or does `${^WIN32_SLOPPY_STAT} = 1;` still work for that?) Though, it is possible to get Perl to quickly iterate over file names in Win32 by avoiding readdir and instead calling FindFirstFile and FindNextFile more directly. There is even such code hidden deep in the archives of this very website. I'll probably eventually succeed in finding it at which point I'll post a pointer to such. Update: Re: Threads slurping a directory and processing before conclusion looks useful (or at least interesting). It hints that one can get sloppy stat via some special Perl variable. I have not yet looked into whether that is still true. Re: Quickest way to get a list of all folders in a directory says similar things and fills in one more detail. Re^3: Win32api::File and Directories offers some code that might be another good route. - tye	[reply] [d/l] [select]