in reply to Is this normal for File::Find or where did I go wrong.

You're reading the entire file and throwing it into a scalar. What if you read a 400 MB file? Got memory? Even if you have the memory, do you really want to waste time churning through non-text files if you don't need to?

Your best bet is to process the file one record at a time.

This still leaves a potential problem: what if you process a 400 MB file with no newlines? Perl will treat the entire 400 MB file as one (big) record. This is probably not what you want to happen.

The solution to this would probably be to read an arbitrary chunk of data, say 256 bytes, from the top of the file and check for a newline and any characters not included in [ -~\r\n\t\f]. If you don't find a newline in the first 256 characters (or whatever limit makes the most sense to you), or if you find characters outside the aforementioned class, chances are you're looking at a file that does not contain text and you should probably print a warning, close the file and move to the next file.

I recently wrote a DOS to UNIX newline conversion script that tries to address these issues. Rather than waste bandwidth, you can see the code at http://dalnet-perl.org/crlflf.txt . This is actually a port of someone else's bash script. The original failed to use sanity checks and ran quite slowly as a result.

  • Comment on Re: Is this normal for File::Find or where did I go wrong.

Replies are listed 'Best First'.
Re: Re: Is this normal for File::Find or where did I go wrong.
by illitrit (Friar) on May 03, 2001 at 06:17 UTC
    Thanks for your comments,

    Something I failed to place in my original query was that I was aware of certain things about the files I'd be looking at:

    A) They are all files for a webserver, each subdirectory of the root directory are separate domains.

    B) I could and probably should have done a quick test on the file name to make sure I only changed .html files however this is hindsight.

    C) Due to A) I knew none of the files would be bigger than a few hundred kilobytes at most and the server has plenty more than that in physical RAM.

    Thanks again,
    James