illitrit has asked for the wisdom of the Perl Monks concerning the following question:

Today I was asked to change the expression "origtext" to "changedtext" on every file within a large directory structure. For lack of knowledge of a better tool I decided to use perl and File::Find. My question is about File::Find in general so I'll post the bare skeleton that I wrote to just list the files that are about to be changed (i.e. a grep type script).

What I encountered was my find.pl script (listed below) ran out of memory on the machine when I ran the script on the base directory (containing 395 subdirectories). I ended up writing a small script to do the find.pl script on one sub-directory at a time.

Was the out of memory problem due to a failure in my use of perl, die I miss something in the File::Find pod, or is it due to the way File::Find works on recursion?

This is the script that lists the files I'm going to change, really just a test that I had the RE and other bits right before actually changing files.
find.pl
#!/usr/bin/perl -w use strict; use File::Find; my($find, @directories) = (@ARGV); unless (scalar @directories) { print "Usage: find.pl FINDSTRING DIR1...\n"; exit; } find(\&do_this, @directories); exit; sub do_this { if(! -f $_) { # if it's not a regular file skip it return; } if(open(F, $_)) { undef $/; my $file = <F>; if((defined $file) && ($file ne '') && ($file =~ /\Q$find\E/giso)) { # the if defined was added # later after I realized that an # empty file would leave $file undef! print "$File::Find::name\n"; } close(F); } else { warn "Unable to open '$_': $!"; } }


Just a note in the end the job was accomplished I ask now for learning purposes so next time I can do it better or more correctly.

Thanks for any advice you can give, including general critique,
James

Replies are listed 'Best First'.
Re: Is this normal for File::Find or where did I go wrong.
by mr.nick (Chaplain) on May 03, 2001 at 06:40 UTC
    Your solution is nice and Perlish, but (if you care), I would have implemented it like this:
    find /the/start/dir -type f -exec perl -i~ \ -e 's/origtext/changedtext/g;print' {} \;
    If you wanted to limit it to html files only, then a
    find /the/start/dir -type f -name '*.html' ...
    would have worked. In fact, using sed or tr instead of Perl would work, 'cept I don't know how to do inplace editing with them.

    Of course, this doesn't have the reusability factor of your solution; but in this case I would have sacrificed it for quickness.

    A big part of administration is knowing the right tool for the job.

Re: Is this normal for File::Find or where did I go wrong.
by converter (Priest) on May 03, 2001 at 06:09 UTC
    You're reading the entire file and throwing it into a scalar. What if you read a 400 MB file? Got memory? Even if you have the memory, do you really want to waste time churning through non-text files if you don't need to?

    Your best bet is to process the file one record at a time.

    This still leaves a potential problem: what if you process a 400 MB file with no newlines? Perl will treat the entire 400 MB file as one (big) record. This is probably not what you want to happen.

    The solution to this would probably be to read an arbitrary chunk of data, say 256 bytes, from the top of the file and check for a newline and any characters not included in [ -~\r\n\t\f]. If you don't find a newline in the first 256 characters (or whatever limit makes the most sense to you), or if you find characters outside the aforementioned class, chances are you're looking at a file that does not contain text and you should probably print a warning, close the file and move to the next file.

    I recently wrote a DOS to UNIX newline conversion script that tries to address these issues. Rather than waste bandwidth, you can see the code at http://dalnet-perl.org/crlflf.txt . This is actually a port of someone else's bash script. The original failed to use sanity checks and ran quite slowly as a result.

      Thanks for your comments,

      Something I failed to place in my original query was that I was aware of certain things about the files I'd be looking at:

      A) They are all files for a webserver, each subdirectory of the root directory are separate domains.

      B) I could and probably should have done a quick test on the file name to make sure I only changed .html files however this is hindsight.

      C) Due to A) I knew none of the files would be bigger than a few hundred kilobytes at most and the server has plenty more than that in physical RAM.

      Thanks again,
      James
Re: Is this normal for File::Find or where did I go wrong.
by Sinister (Friar) on May 03, 2001 at 16:44 UTC
    What I usually do when using Find is the following:
    $RootDir = "/www/htdocs"; find(\&wanted, $RootDir); foreach (@list) { ...do some... } sub wanted { my $arg = $_ if ( $arg =~ /\.html/ ) { push @list, $arg; } $_ = $arg return; }


    Concluded. I put $_ into $arg and put it back later, don't exactly know why it works, but compared to not doing that, it was better. {looks puzzled}

    Sinister greetings.