Readdir against large number of files

learningperl01 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl experts,
I am pretty new to Perl and having a memory problem with my code. I have narrowed it down to the following section of code which always seems to generate the out of memory error when running.(when the number of files are large).

                opendir (TXT, "$processedDirPath") or die "Cannot open
+ $processedDirPath directory $!\n";
                while ( (my $matchingFiles = readdir(TXT)) ) {
                    if ( $matchingFiles =~ /\.rtsd001$/ ) {
                        my @fileContents;
                        my $UNMODIFIED;
                        my $MODIFIED;
                        open(UNMODIFIED, "<$processedDirPath/$matching
+Files") or die "Cannot open $UNMODIFIED for reading $!\n";
                        open(MODIFIED, ">$processedDirPath/$matchingFi
+les.old") or die "Cannot open $MODIFIED for writing $!\n";
                        while (<UNMODIFIED>) {
                            /STRUC20/ and @fileContents=(), next or pu
+sh @fileContents, $_; # All files that end in rtsd001 will need to be
+ modified.
                        }
                        print MODIFIED @fileContents;
                        close(UNMODIFIED);
                        close(MODIFIED);
                        system( "/bin/mv", "$processedDirPath/$matchin
+gFiles.old", "$processedDirPath/$matchingFiles" ) == 0 or warn "Move 
+command filed $!\n";;
                    }
                }
                closedir TXT;
[download]

Now a little background. I want to process all files in a directory which is held by $processedDirPath variable and only work with files that end with "rtsd001".

I have also tried saving the readdir to an @array then grepping for the file extensions I want and doing a while loop against that.

What is the MOST memory efficient/best way to read files in a directory then modify each file found (Script will have to rewrite all files to remove lines before the STRUC20 line.)

The directory which I am running readdir against contains about 6-10 thousand files each about <1k - 40k> in size.

So, working with this large number of files what would be the best way to do this?
Thanks for the help in advance.

Comment on Readdir against large number of files Download Code

Replies are listed 'Best First'.
Re: Readdir against large number of files by moritz (Cardinal) on Oct 28, 2009 at 17:18 UTC
I'd try to use glob in scalar context: `while (my $matchingFiles = glob("*.rtsd001")) { ... }` [download] Though that doesn't seem to your real problem - in your current version you already have only one filename in memory each time. I'd rather think your problem is `@fileContents`, which can hold the contents of an entire file. Instead of reading a file, selecting lines, storing those, and printing the result, you can just print them line by line: `while (<UNMODIFIED>) { /STRUC20/ and print MODIFIED $_; }` [download] Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^2: Readdir against large number of files by learningperl01 (Beadle) on Oct 28, 2009 at 17:58 UTC
Thanks for the help. I gave your code a try but it is only printing the line STRUC20. What I was looking to do is print all lines in the file after STRUC20. And also all files online contain the line STRUC20 once and they will all contain that line. Thanks again for all the help. Example file contents of file1.rtsd001 `date time blah blah test test STRUC20 #Code should look for this line and print everything after it +. need lines need lines print these lines` [download]	[reply] [d/l]
Re^3: Readdir against large number of files by almut (Canon) on Oct 28, 2009 at 19:03 UTC
What I was looking to do is print all lines in the file after STRUC20 The flip-flop operator (aka the range operator `..` in scalar context) might come in handy for that: `while (<UNMODIFIED>) { print MODIFIED $_ unless 1 .. /STRUC20/; # i.e. do print, except for first line upto a line that matches ST +RUC20 }` [download]	[reply] [d/l] [select]
Re^3: Readdir against large number of files by moritz (Cardinal) on Oct 28, 2009 at 18:45 UTC
You're right, I didn't read your original code carefully enough. You can introduce a variable that stores if it's after the `STRUC20` line: `while (my $matchingFiles = glob("*.rtsd001")) { ... my $seen_struct; while (<UNMODIFIED>) { if ($seen_struct \|\| /STRUCT20/) { $seen_struct = 1; print MODIFIED $_; } } ... }` [download] That way you don't have to store all the rest of the lines in memory, and still get the same semantics. Update: actually this solution looks more elegant, but does nearly the same thing. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re: Readdir against large number of files by JavaFan (Canon) on Oct 28, 2009 at 17:17 UTC
Even if you know the filename, than on (most) file systems, you'd have to scan the directory entries anyway, as file names are not stored in a way to quickly search for them by name. So I doubt there's much to gain in who you process the directory content. But you might gain a bit if the files are big. In the worst case, you store the entire content in memory. You could also write the inner loop as: `use Fcntl 'SEEK_SET'; while (<UNMODIFIED>) { if (/STRUC20/) { seek MODIFIED, SEEK_SET, 0 or die; next; } print MODIFIED $_; } truncate MODIFIED, tell MODIFIED or die;` [download] Alternatively, find the start of the line following the last line mentioning `STRUC20`, then use `sysread` and `syswrite` to copy the last part of the file. And there's no need to use `/bin/mv` to move the temp. file. `rename` will do. Of course, the MOST efficient way doesn't involve Perl at all.	[reply] [d/l] [select]
Re: Readdir against large number of files by jakobi (Pilgrim) on Oct 28, 2009 at 17:26 UTC
I also fail to see obvious memory leak issues, so maybe it's time for some paranoia: `use strict; use warnings;` protect against pathologic filenames and use the 3 argument form of `open(UNMODIFIED,"<","$processedDirPath/$matchingFiles")` (good style, but unlikely to be the problem) my bet: add a size check like `-f "$processedDirPath/$matchingFiles" and -s _ < 40960 and do{warn "size or type problem with $processedDirPath/$matchingFile +s\n"; next};` [download] idle curiousity: there won't ever occur further STRUC20 alone or as part of something else after the first occurance? cu & HTH, Peter -- hints may be untested unless stated otherwise; use with caution & understanding. Update: what do you mean with print DATA!? These changes don't touch your copying loop at all. anyway, the size testing goes before your open() calls, that is: in the outer loop. btw, consider rename instead of mv.	[reply] [d/l] [select]
Re^2: Readdir against large number of files by learningperl01 (Beadle) on Oct 28, 2009 at 17:53 UTC
Thanks for the help. I gave your code a try but it is only printing the line DATA. What I was looking to do is print all lines in the file after STRUC20. And also all files online contain the line STRUC20 once and they will all contain that line. Thanks again for all the help. Example file contents of file1.rtsd001 `date time blah blah test test STRUC20 #Code should look for this line and print everything after it +. need lines need lines print these lines` [download]	[reply] [d/l]
Re: Readdir against large number of files by gmargo (Hermit) on Oct 28, 2009 at 18:16 UTC
Where's your state variable? `my $printing = 0; while (<UNMODIFIED>) { if ($printing) { print MODIFIED $_; next; } $printing++ if /STRUC20/; }` [download] Also, you don't need the `/bin/mv`, just use unlink and rename.	[reply] [d/l] [select]
Re^2: Readdir against large number of files by Not_a_Number (Prior) on Oct 28, 2009 at 20:34 UTC
Why the `next`?	[reply] [d/l]
Re^3: Readdir against large number of files by gmargo (Hermit) on Oct 28, 2009 at 22:50 UTC
Because additional regular expression comparisons are pointless.	[reply]