Muy Large File

BuddhaLovesPerl has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Muy Large File by BrowserUk (Patriarch) on Mar 14, 2005 at 09:01 UTC
... this is taking over 4 hours.... You're doing something wrong :). The following shows Perl processing a 32 GB file in-place, finding and replacing 30% of it's contents in under 25 minutes; on a single cpu 512 MB ram machine. (the process only uses 3 MB of ram). #! perl -slw use strict; our $BUFSIZE \|\|= 2**20; open my $fh, '+< :raw', $ARGV[ 0 ] or die $!; while( sysread $fh, $_, $BUFSIZE ) { tr[123][123]; sysseek $fh, -length(), 1; ## Updated per Dave_the_m's correction +below++ syswrite $fh, $_; } close $fh; __DATA__ [ 8:31:52.64] P:\test>439181 data\integers.dat [ 8:54:43.92] P:\test>dir data\integers.dat Volume in drive P has no label. Volume Serial Number is BCCA-B4CC Directory of P:\test\data 14/03/2005 08:54 34,359,738,368 integers.dat 1 File(s) 34,359,738,368 bytes [download] Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco.	[reply] [d/l]
Re^2: Muy Large File by dave_the_m (Monsignor) on Mar 14, 2005 at 10:53 UTC
One slight nit: if the file size isn't a multiple of the buffer size, the final seek will seek back too far and corrupt the final block. `while(sysread $fh, $_, $BUFSIZE ) { tr[123][123]; sysseek $fh, -length(), 1; syswrite $fh, $_; }` [download] (untested). Dave.	[reply] [d/l]
Re^2: Muy Large File by gam3 (Curate) on Apr 12, 2005 at 17:36 UTC
Another problem is with data that crosses the buffer. -- gam3 A picture is worth a thousand words, but takes 200K.	[reply]
Re: Muy Large File by perlfan (Vicar) on Mar 14, 2005 at 19:24 UTC
I have a single file that ranges daily from 45-50Gig on a Solaris 8 server with 16G ram and 8 900Mhz cpu's. How can you not be taking advantage of that horsepower? I would seriously look into MPICH's implementation ROMIO http://www-unix.mcs.anl.gov/romio/ (MPI Standard 2.0) http://www.mpi-forum.org/docs/mpi-20-html/node171.htm#Node171 If it has to be Perl, then I would certainly look into parallelizing this application - as a brute force approach, split the file 8 ways, run a process to take of each piece, then join the darn things back together. Even with the splitting and rejoining, I am sure it would be faster than what is happening right now.	[reply]
Re^2: Muy Large File by crenz (Priest) on Mar 15, 2005 at 11:41 UTC
Since he needs it to be in-place, I suggest 8 processes that work on the same file at the same time... if that's possible on Solaris. YOu can use `seek` for that.	[reply] [d/l]
Re: Muy Large File by TilRMan (Friar) on Mar 14, 2005 at 10:33 UTC
Update: As thor points out, my solution here is quite wrong. Sorry. On unixy platforms, Perl has exactly what you need already built in. (Windows users may need a `binmode` somewhere.) `#!/usr/bin/perl -w -pi use strict; tr/A-Z/ /;` [download] If your records do not have newlines at the end, then you will need to set the record length. Add the following line to the script, replacing 4096 with the record length: `BEGIN { $/ = \4096 }` [download] The magic is in the -pi which turns on in-place editing in a loop (more at perlrun). Then the `tr///` operator runs on every record, replacing the offending characters with spaces (more at perlop). Use `tr///` instead of `s///`; it's probably faster and safer. Note: Code is mostly untested. Use with caution.	[reply] [d/l] [select]
Re^2: Muy Large File by thor (Priest) on Mar 14, 2005 at 13:28 UTC
The problem is that while the -i switch is the "in place" switch, it isn't really in place. From perldoc perlrun It does this by renaming the input file, opening the output file by the original name, and selecting that output file as the default for print() statements. As per the OP, the file is too large to do this with. thor `Feel the white light, the light within Be your own disciple, fan the sparks of will For all of us waiting, your kingdom will come`	[reply]
Re: Muy Large File by Anonymous Monk on Mar 15, 2005 at 20:27 UTC
Are other processes running at the same time as you do your conversion? If there's a set of constant-HD-use programs running, the HD will have to seek for many of the block read/writes, and it's builtin cache won't be anywhere near as effective. So - my first guess? Try shutting down all services and kicking all users, if that's amenable to the wraith types. Run it for an hour and see how far through it gets. Even if this isn't the case, be cautious with parallel processes; HD misses are on the order of milliseconds iirc, which means that at some number of processes you're going to have the bottleneck come from HD access not from CPU time and RAM accesses. To check: use a clock-tick timer and time a read of one (non-cached, and make sure it seeks!) block off the HD. Ditto for a write (they should be almost the same, though if it caches the write - eg. doesn't use write-through - it could be longer). Then time a search/replace. Betcha the latter is faster.	[reply]
Re^2: Muy Large File by BuddhaLovesPerl (Sexton) on Mar 16, 2005 at 09:25 UTC
Wow. Many deep bows of reverence for all that responded. As UK inferred, I was (indeed) doing something wrong. Based on the above suggestions, this was the script tested: #!/usr/local/perl5.6.1/bin/perl -slw use strict; our $BUFSIZE \|\|= 2**30; open my $fhi, '+<', "/data/p_dm200/ndm_ip_pull/test_customer1" or die $!; while( sysread $fhi, $_, $BUFSIZE ) { tr^M ; sysseek $fhi, -length(), 1; syswrite $fhi, $_, $BUFSIZE; } close $fhi; which was tested against an 8,595,447,728 byte file. The time output was: real 10m5.95s user 1m48.55s sys 0m17.24s An amazing 10 minutes. I checked the output and it looks exactly as expected. I even retested 3 times and each time the results were similar. Ok, now I am getting greedy and curious as to if this can be optimized more?? I ran top during this session and saw that SIZE and RES were both around 1026M throughout the duration and only 1 cpu seemed used. Would increasing BUFSIZE help performance linearly? If I was capable (and I am not) would either shared memory threads or parallel forks produce big gains? Any other low-hanging fruit? Perlfan, the ROMIO seemed interesting but I could not find a perl sample. Still it seemed interesting. Anonymous Monk, please forgive my ignorance but what does HD mean? A sincere thanks to all, --Paul	[reply]
Re^3: Muy Large File by BrowserUk (Patriarch) on Mar 16, 2005 at 11:12 UTC
Using a larger buffer size may increase throughput slightly, but then it may not. It will depend upon many factors mostly to do with your file system buffering, disk subsystems etc. The easy answer, given it's only taking 10 minutes is to try it. As far as using threads to distribute the load across your processors is concerned, it certainly could be done, and could in theory, give you near linear reductions per extra processor. But, and it's big 'but', how good is your runtime library's handling of multi-threaded IO to a single file? On my system, even using sys* IO calls and careful locking to ensure that only one thread can seek&read or seek&write at a time, something, somewhere is getting confused and the file is corrupted. I suspect that even after a syswrite completes, the data is not yet fully flushed to disk before the next seek&write cycle starts. So, maybe you could get it to work on your system, but I haven't succeeded on mine, and I am not yet entirely sure whether the problem lies within Perl, the OS, or some combination of the two. If you feel like trying this, please come back and report your findings. If you need a starting point, /msg me and I can let you have my failing code. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply]
Re^3: Muy Large File by Random_Walk (Prior) on Mar 16, 2005 at 11:36 UTC
By HD anonymonk means Hard Disk. The seek times to move the heads around a hard drive are slow compared to memory access and geological compared to processor cache. What this means is processes that are dedicated to doing something to a file are normal disk IO bound. Lets not mention network latencies for now. If you do manage to split this into threads you may actually reduce performance as each time a different thread gets a shot at it, it forces the HD to drag it's heads over to a completely different part of disk. A single thread reading the file sequentially will not be making the heads seek so much, assuming the file is not desperately fragmented on the media. Then there are other users competing for those heads and tasking them off to the boondocks of the drive as far as your data is concerned which is why it was suggested you kick the lusers to try and get the disk all to yourself. Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply]
Re^4: Muy Large File by Anonymous Monk on Mar 16, 2005 at 12:06 UTC
Re^5: Muy Large File by Random_Walk (Prior) on Mar 16, 2005 at 15:21 UTC
Re^4: Muy Large File by BuddhaLovesPerl (Sexton) on Mar 19, 2005 at 01:41 UTC
Re^5: Muy Large File by BrowserUk (Patriarch) on Mar 19, 2005 at 07:03 UTC
Some notes below your chosen depth have not been shown here
Re^5: Muy Large File by tlm (Prior) on Mar 20, 2005 at 14:42 UTC


We don't bite newbies here... much
	PerlMonks