jmaya has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,
I have a 300Gig text file that I need to run through. I tried (in short)
--------------------------------------
open OUT, "file.txt" or die "$!\n"; while(<OUT>) { do stuff here }

--------------------

It was my belief that when you do the while bit here, it does not open the whole file. The result is it HANGS.

Can some one explain?

Thanks
John

Code tags and writeup formatting touched up by davido.

Replies are listed 'Best First'.
Re: Iterating through HUGE FILES
by Joost (Canon) on May 10, 2005 at 19:34 UTC
    The part of the code you show here is good.

    Is your perl compiled with large (that is >2Gb) file support?

    You can tell by doing

    > perl -V:uselargefiles uselargefiles='define';
    If the output isn't uselargefiles='define'; you need to get or compile another perl binary. Uselargefiles is an option you need to set when compiling the perl interpreter, though the default in recent perls ( I believe since 5.8.0 ) is to turn it on.

      So that WAS the problem. I know there was some compiled-in limit. Thank you for mentioning this.

      If this is the problem (which seems likely) with the original poster's code, again I suggest using other utilities to break up the data set into manageable chunks, and then processing those chunks in perl.

        Other utilities, like say: cat HUGE | perl my_script.pl

        ...This is, of course, bait for Merlyn to jump all over :-)

        Seriously though, can you simply read from STDIN ? Then your Perl shouldn't care how big the file is.

      It is activestate's perl I think it was not compiled with that parameter. Thank you

        Which version of AS Perl? It must be petty ancient as the last 7 or 8 version (at least) have been built with large file support. On win32 anyway. It's easy to forget that they also produce binaries for other OSs.

        If you cannot upgrade for any reason, then I second the idea of using a system utility to read the file and pipe it into your script. I'd probably do it using the 'piped open'. If you need to re-write the data, send it to stdout and redirect the output via the command line.

        die "You didn't redirect the output" if -t STDOUT; open BIGFILE, "cmd/c type \path\to\bigfile |" or die $!; while( <BIGFILE> ) { ## do stuff } close BIGFILE; __END__ script bigfile.dat > modified.dat

        Dying if STDOUT hasn't been re-directed is a touch that you'll appreciate after the first time you print a huge binary file to the console by accident. The bells! The bells! :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Iterating through HUGE FILES
by Animator (Hermit) on May 10, 2005 at 17:14 UTC

    Well, are there newlines in the file? (or to be correct: is the value of the input record seperator somewhere in the file?)

    If there aren't then that's your problem. What can you do then is either set the input record seperator ($/) to another charachter, or set it to an integer-reference, then X bytes will be read. (for example $/=\123; now <OUT> will read 123 bytes each time)

      To add to/support what Animator said, I use the $/=\123 trick regularly at work.  The implementation reads in 2M worth of data, processes it, seeks back 1k and reads in another 2M chunk.  The code seeks back 1k because the processing involves regular expressions and we want to test if a regular expression straddles the 2M boundary.  If you have variable length records, this may not work so well as it is very likely the end of the read in buffer will fall in the middle of a record.

Re: Iterating through HUGE FILES
by dave_the_m (Monsignor) on May 10, 2005 at 16:39 UTC
    It was my belief that when you do the while bit here, it does not open the whole file. The result is it HANGS
    The code above looks okay. Processing 300Gb of anything is going to take quite some time. Are you sure it's hanging rather than just taking a long time?

    Dave.

Re: Iterating through HUGE FILES
by dynamo (Chaplain) on May 10, 2005 at 16:57 UTC
    I had a very similar problem working with a HUGE text file myself once. It was about 50 gigs and I ran into a limit that felt like something compiled into perl or my system libraries - I tried all sorts of work-arounds to try to keep the processing incremental.

    In the end, I found my solution in using an external program written in shell to send the lines to perl, one at a time. It was slower than it probably would have been running the loop in perl, but it worked. Also make sure to see if you get any relief from your problem if you pipe it in through cat or similar - piping your input makes it impossible to seek within the file - and I believe that perl treats it differently.

    Sorry for all the hand-waving, but when you are having bugs that shouldn't occur, you have to be willing to try solutions that shouldn't work.

Re: Iterating through HUGE FILES
by ikegami (Patriarch) on May 10, 2005 at 16:44 UTC

    What makes you think it's hanging and not taking a long time? What makes you think the problem isn't with "do stuff here"?

    By the way, I'm curious as to why you named the input file handle "OUT".

Re: Iterating through HUGE FILES
by gellyfish (Monsignor) on May 10, 2005 at 16:47 UTC

    Add a print '.'; as the first thing in the while to see the progress - of course if it is a small number of very large lines rather than a very large number of small lines then you might not see much happening.

    /J\

      Make sure you $| = 1 beforehand, though (or you print to STDERR).

      Also useful is something like print STDERR '.' if $. % 100 == 0 to get biff'd every 100 lines.

Re: Iterating through HUGE FILES
by sh1tn (Priest) on May 10, 2005 at 17:42 UTC
    You can recheck that the reading process does not stop
    with printing each (for example) 1000 line:
    while( <OUT> ){ $. % 1000 or print "line $. \n"; # do stuff here }


Re: Iterating through HUGE FILES
by Adrade (Pilgrim) on May 11, 2005 at 00:54 UTC
    Dear John,

    You may want to consider using the sysread() call. Although I can't say for sure, this might solve your problem...
    sysopen(FILE, $filename, 0); while (sysread(FILE, $buffer, 10240)) { print $buffer; } close(FILE);
    I hope it helps!
      -Adam
Re: Iterating through HUGE FILES
by sk (Curate) on May 11, 2005 at 04:23 UTC
    I created a dummy ~5GB file.
    perl -le 'BEGIN{$,=","} print map int rand 1000, 1..2500 for 1..547_18 +3' > infile.csv [sk]% time wc -l infile.csv 547183 numbers.csv 2.730u 11.660s 1:32.06 15.6% [sk]% time perl -nle '$line++; print +($line-1) if eof;' infile.csv 547183 19.600u 4.560s 0:24.16 100.0%

    Agreed 300GB is freaking large! But Perl was able to read this 5GB file very fast. I don't see a huge problem just reading a 300GB file.

    It will be hard for us to identify where the program is stalling without looking at the "do stuff here" block. For example, if the file you are reading in is a CSV file and you parse it to get a HUGE list then it will slow down your process. Thinking ahead and designing the right input file for processing will solve runtime issues. For example, if you need only certain portions of each line then you might want to trim down the input file separately before you start your "core" process!

    Also have you tried running this script on a smaller file?

    % head -10000 inputfile > smallfile % script smallfile

    See if this completes. If it does then there are some issues with large file.

    Are there lines inside your while block that do not have to be proceesed for every record?

    cheers

    SK

    PS: Just curious what kind of application requires 300GB of file? How do you manage such large files? Very thought of backup scares me :)

Re: Iterating through HUGE FILES
by smullis (Pilgrim) on May 12, 2005 at 10:57 UTC

    Hello there,


    I'm surprised no-one else has mentioned this but you should probably check out Tie::File. (Maybe it has been mentioned and I didn't notice.... But anyway).


    ...lifted straight from the module docs...

    # This file documents Tie::File version 0.96 use Tie::File; tie @array, 'Tie::File', filename or die ...; $array[13] = 'blah'; # line 13 of the file is now 'blah' print $array[42]; # display line 42 of the file $n_recs = @array; # how many records are in the file? $#array -= 2; # chop two records off the end for (@array) { s/PERL/Perl/g; # Replace PERL with Perl everywhere in the + file } # These are just like regular push, pop, unshift, shift, and splice # Except that they modify the file in the way you would expect push @array, new recs...; my $r1 = pop @array; unshift @array, new recs...; my $r2 = shift @array; @old_recs = splice @array, 3, 7, new recs...; untie @array; # all finished

    Cheers
    SM