You're reading the entire file and throwing it into a scalar. What if you read a 400 MB file? Got memory? Even if you have the memory, do you really want to waste time churning through non-text files if you don't need to?

Your best bet is to process the file one record at a time.

This still leaves a potential problem: what if you process a 400 MB file with no newlines? Perl will treat the entire 400 MB file as one (big) record. This is probably not what you want to happen.

The solution to this would probably be to read an arbitrary chunk of data, say 256 bytes, from the top of the file and check for a newline and any characters not included in [ -~\r\n\t\f]. If you don't find a newline in the first 256 characters (or whatever limit makes the most sense to you), or if you find characters outside the aforementioned class, chances are you're looking at a file that does not contain text and you should probably print a warning, close the file and move to the next file.

I recently wrote a DOS to UNIX newline conversion script that tries to address these issues. Rather than waste bandwidth, you can see the code at http://dalnet-perl.org/crlflf.txt . This is actually a port of someone else's bash script. The original failed to use sanity checks and ran quite slowly as a result.


In reply to Re: Is this normal for File::Find or where did I go wrong. by converter
in thread Is this normal for File::Find or where did I go wrong. by illitrit

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.