grepdashv has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that has worked fine against many small files, but when I tried to run it against a large file (~850MB), it simply hung. I tried modifying the script in order to troubleshoot it, and I found that it was hanging at the very first readline. In addition to being a very large file, it may have another problematic characteristic -- the first "line" may be composed of an extremely large number of ASCII null characters all strung together. Could the length of the file, the length of that line, or the weird characters cause read and readline to hang?

Replies are listed 'Best First'.
Re: hang caused by read / readline?
by ikegami (Patriarch) on Sep 04, 2009 at 02:12 UTC

    Why are you using readline to read a file that doesn't contain lines?

    Well actually, readline can be configured to effectively become a read by setting $/ to a reference to a number. For example,

    local $/ = \4096; while (<$fh>) { # $_ contains one 4k chunk ... }
      The file does contain lines. However, there is a chunk of chr(0) that has been appended (by a process outside my control) to the front of the first line, and I didn't realize that there was an INSANE quantity of chr(0) until I was able to demonstrate it just a bit ago. So, this has exposed the next layer of the problem. I now know why readline() was choking (and why I thought that read() was choking even though it really wasn't, though that's not relevant). So, what's the best way to quickly breeze past the chr(0) mess -- now at 385MB and counting -- to find the start of the real data?
        Use block mode to filter out the NULs first.
        perl -pe'BEGIN { $/ = \(64*1024); } s/\0+//g' infile | line_reading_sc +ript.pl
Re: hang caused by read / readline?
by Anonymous Monk on Sep 04, 2009 at 02:13 UTC
    You are probably trying to read the whole file into memory. Its best if you show a small self contained program which replicates the problem.
      Sorry, I forgot to include a snippet because my wife was rushing me out the door. ;-) No, I'm not reading the whole file into an array; I'm using scalar context to read in one line at a time:
      ... open (INPUTFILE, $inputfile) or die ("\nERROR: Unable to open file +\"$inputfile\".\n"); while (!eof(INPUTFILE)) { $line = readline(INPUTFILE); ...
      And that's where it chokes. If I insert a print statement before and after the readline (as checkpoints), the first one will work, and the second one will not. I don't have experience using read, and when I tried replacing the readline with a simple read earlier, It looked like the same hang was happening. However, when I tried it again just now, it worked fine:
      read(INPUTFILE, $x, 1);
      Since I knew (from MUCH smaller files) that the input files would have chunks of chr(0) at the front, I wrote a bit of code around the read statement above to see just how bad the situation is for this particularly large file. As I write this message, we're at 120MB worth of continuous chr(0) and counting...