Iterating through HUGE FILES

jmaya has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Iterating through HUGE FILES by Joost (Canon) on May 10, 2005 at 19:34 UTC
The part of the code you show here is good. Is your perl compiled with large (that is >2Gb) file support? You can tell by doing `> perl -V:uselargefiles uselargefiles='define';` [download] If the output isn't `uselargefiles='define';` you need to get or compile another perl binary. Uselargefiles is an option you need to set when compiling the perl interpreter, though the default in recent perls ( I believe since 5.8.0 ) is to turn it on. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^2: Iterating through HUGE FILES by dynamo (Chaplain) on May 10, 2005 at 21:17 UTC
So that WAS the problem. I know there was some compiled-in limit. Thank you for mentioning this. If this is the problem (which seems likely) with the original poster's code, again I suggest using other utilities to break up the data set into manageable chunks, and then processing those chunks in perl.	[reply]
Re^3: Iterating through HUGE FILES by mobiGeek (Beadle) on May 12, 2005 at 03:50 UTC
Other utilities, like say: `cat HUGE \| perl my_script.pl` ...This is, of course, bait for Merlyn to jump all over `:-)` Seriously though, can you simply read from STDIN ? Then your Perl shouldn't care how big the file is.	[reply] [d/l]
Re^2: Iterating through HUGE FILES by jmaya (Acolyte) on Feb 19, 2006 at 02:46 UTC
It is activestate's perl I think it was not compiled with that parameter. Thank you	[reply]
Re^3: Iterating through HUGE FILES by BrowserUk (Patriarch) on Feb 19, 2006 at 06:26 UTC
Which version of AS Perl? It must be petty ancient as the last 7 or 8 version (at least) have been built with large file support. On win32 anyway. It's easy to forget that they also produce binaries for other OSs. If you cannot upgrade for any reason, then I second the idea of using a system utility to read the file and pipe it into your script. I'd probably do it using the 'piped open'. If you need to re-write the data, send it to stdout and redirect the output via the command line. `die "You didn't redirect the output" if -t STDOUT; open BIGFILE, "cmd/c type \path\to\bigfile \|" or die $!; while( <BIGFILE> ) { ## do stuff } close BIGFILE; __END__ script bigfile.dat > modified.dat` [download] Dying if STDOUT hasn't been re-directed is a touch that you'll appreciate after the first time you print a huge binary file to the console by accident. The bells! The bells! :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Iterating through HUGE FILES by Animator (Hermit) on May 10, 2005 at 17:14 UTC
Well, are there newlines in the file? (or to be correct: is the value of the input record seperator somewhere in the file?) If there aren't then that's your problem. What can you do then is either set the input record seperator (`$/`) to another charachter, or set it to an integer-reference, then X bytes will be read. (for example `$/=\123;` now <OUT> will read 123 bytes each time)	[reply] [d/l] [select]
Re^2: Iterating through HUGE FILES by northwind (Hermit) on May 10, 2005 at 17:53 UTC
To add to/support what Animator said, I use the `$/=\123` trick regularly at work. The implementation reads in 2M worth of data, processes it, `seek`s back 1k and reads in another 2M chunk. The code `seek`s back 1k because the processing involves regular expressions and we want to test if a regular expression straddles the 2M boundary. If you have variable length records, this may not work so well as it is very likely the end of the read in buffer will fall in the middle of a record.	[reply]
Re: Iterating through HUGE FILES by dave_the_m (Monsignor) on May 10, 2005 at 16:39 UTC
It was my belief that when you do the while bit here, it does not open the whole file. The result is it HANGS The code above looks okay. Processing 300Gb of anything is going to take quite some time. Are you sure it's hanging rather than just taking a long time? Dave.	[reply]
Re: Iterating through HUGE FILES by dynamo (Chaplain) on May 10, 2005 at 16:57 UTC
I had a very similar problem working with a HUGE text file myself once. It was about 50 gigs and I ran into a limit that felt like something compiled into perl or my system libraries - I tried all sorts of work-arounds to try to keep the processing incremental. In the end, I found my solution in using an external program written in shell to send the lines to perl, one at a time. It was slower than it probably would have been running the loop in perl, but it worked. Also make sure to see if you get any relief from your problem if you pipe it in through cat or similar - piping your input makes it impossible to seek within the file - and I believe that perl treats it differently. Sorry for all the hand-waving, but when you are having bugs that shouldn't occur, you have to be willing to try solutions that shouldn't work.	[reply]
Re: Iterating through HUGE FILES by ikegami (Patriarch) on May 10, 2005 at 16:44 UTC
What makes you think it's hanging and not taking a long time? What makes you think the problem isn't with "do stuff here"? By the way, I'm curious as to why you named the input file handle "OUT".	[reply]
Re: Iterating through HUGE FILES by gellyfish (Monsignor) on May 10, 2005 at 16:47 UTC
Add a `print '.';` as the first thing in the `while` to see the progress - of course if it is a small number of very large lines rather than a very large number of small lines then you might not see much happening. /J\	[reply] [d/l] [select]
Re^2: Iterating through HUGE FILES by Fletch (Bishop) on May 10, 2005 at 16:51 UTC
Make sure you `$\| = 1` beforehand, though (or you print to `STDERR`). Also useful is something like `print STDERR '.' if $. % 100 == 0` to get biff'd every 100 lines.	[reply] [d/l] [select]
Re: Iterating through HUGE FILES by sh1tn (Priest) on May 10, 2005 at 17:42 UTC
You can recheck that the reading process does not stop with printing each (for example) 1000 line: `while( <OUT> ){ $. % 1000 or print "line $. \n"; # do stuff here }` [download]	[reply] [d/l]
Re: Iterating through HUGE FILES by Adrade (Pilgrim) on May 11, 2005 at 00:54 UTC
Dear John, You may want to consider using the sysread() call. Although I can't say for sure, this might solve your problem... `sysopen(FILE, $filename, 0); while (sysread(FILE, $buffer, 10240)) { print $buffer; } close(FILE);` [download] I hope it helps! -Adam	[reply] [d/l]
Re: Iterating through HUGE FILES by sk (Curate) on May 11, 2005 at 04:23 UTC
I created a dummy ~5GB file. `perl -le 'BEGIN{$,=","} print map int rand 1000, 1..2500 for 1..547_18 +3' > infile.csv [sk]% time wc -l infile.csv 547183 numbers.csv 2.730u 11.660s 1:32.06 15.6% [sk]% time perl -nle '$line++; print +($line-1) if eof;' infile.csv 547183 19.600u 4.560s 0:24.16 100.0%` [download] Agreed 300GB is freaking large! But Perl was able to read this 5GB file very fast. I don't see a huge problem just reading a 300GB file. It will be hard for us to identify where the program is stalling without looking at the "do stuff here" block. For example, if the file you are reading in is a CSV file and you parse it to get a HUGE list then it will slow down your process. Thinking ahead and designing the right input file for processing will solve runtime issues. For example, if you need only certain portions of each line then you might want to trim down the input file separately before you start your "core" process! Also have you tried running this script on a smaller file? `% head -10000 inputfile > smallfile % script smallfile` [download] See if this completes. If it does then there are some issues with large file. Are there lines inside your while block that do not have to be proceesed for every record? cheers SK PS: Just curious what kind of application requires 300GB of file? How do you manage such large files? Very thought of backup scares me :)	[reply] [d/l] [select]
Re: Iterating through HUGE FILES by smullis (Pilgrim) on May 12, 2005 at 10:57 UTC
Hello there, I'm surprised no-one else has mentioned this but you should probably check out Tie::File. (Maybe it has been mentioned and I didn't notice.... But anyway). ...lifted straight from the module docs... # This file documents Tie::File version 0.96 use Tie::File; tie @array, 'Tie::File', filename or die ...; $array[13] = 'blah'; # line 13 of the file is now 'blah' print $array[42]; # display line 42 of the file $n_recs = @array; # how many records are in the file? $#array -= 2; # chop two records off the end for (@array) { s/PERL/Perl/g; # Replace PERL with Perl everywhere in the + file } # These are just like regular push, pop, unshift, shift, and splice # Except that they modify the file in the way you would expect push @array, new recs...; my $r1 = pop @array; unshift @array, new recs...; my $r2 = shift @array; @old_recs = splice @array, 3, 7, new recs...; untie @array; # all finished [download] Cheers SM	[reply] [d/l]