in reply to Windows 7 Remove Tabs Out of Memory
You are using File::Slurp wrongly. (For a file of this size!)
When you call my $s = read_file( $filename );, it first reads the entire 500MB into an internal scalar, and then it returns it to you.
Where you then assign it to a scalar in your context.
You now have 2 copies of the data in memory: 1GB! And you haven't done anything with it yet.
You then run your regex on it, which takes around half a second on my machine and causes no memory growth.
Then you pass your copy of the data into write_file(), which means it gets copied onto the stack.
You now have 3 copies of the data in memory: 1.5GB!
And internally to write_file(), it gets copied again. You now have 4 copies of the data in memory: 2GB!
And if you are on a 32-bit Perl, you've blown your heap and get the eponymous "Out of memory!".
And if you are on a 64-bit perl with enough memory, it then spends an inordinate amount of time(*) futzing with the copied data "fixing up " that which isn't broken. Dog knows why it does this. It doesn't need to. Just typical O'Woe over-engineering!.
25 minutes+ 2 hours=!(**) (before I ^C'd it), to write 500MB of data to disk is ridiculous!
(**For a job that can be completed in 8 seconds simply, without trickery, 2 hours is as close to 'Never completes' as makes no difference.)
File::Slurp goes to (extraordinary) lengths in an attempt to "be efficient". (It fails miserably, but I'll get back to that!).
When reading the file, you can avoid the copying of the data, by requesting that the module return a reference to the data, thus avoiding the copying done by the return.
And when writing the file, you can pass that reference back. The module will (for no good reason) still copy the data internally before writing it out, but you do save another copy:
#! perl -slw use strict; use File::Slurp; use Time::HiRes qw[ time ]; print STDERR time; my $s = read_file( $ARGV[0], scalar_ref => 1 ); print STDERR time; $$s =~s/\t/ /g ; print STDERR time; write_file( $ARGV[1], $s ); print STDERR time; __END__ [22:14:07.81] C:\test>984648-2 500MB.csv junk.txt 1343769390.96321 1343769394.24913 1343769394.70982 Terminating on signal SIGINT(2)
This way, you only have one redundant copy of the data in memory for a saving of 1GB Your process won't run out of memory.
However, it will still take 25 minutes+ 2 hours=! (I didn't wait any longer) to actually write 500MB to disk!
How about we try the same thing without the assistance of any overhyped, over-engineered, overblown modules.
#! perl -slw use strict; use Time::HiRes qw[ time ]; print STDERR time; my $s; do{ local( @ARGV, $/ ) = $ARGV[0]; $s = <>; }; print STDERR time; $s =~ tr[\t][ ]; print STDERR time; open O, '>', $ARGV[1] or die $!; { local $\; print( O $s ); } close O; print STDERR time; __END__ [ 0:57:20.47] C:\test>984648-3 500MB.csv junk.txt 1343779056.03211 1343779058.22142 1343779058.70098 1343779061.99852 [ 0:57:42.05] C:\test>
That's efficient!
Bottom line: When you consider using a module for something -- LOOK INSIDE!. If it looks too complicated for what it does; it probably is.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Windows 7 Remove Tabs Out of Memory
by Anonymous Monk on Aug 01, 2012 at 08:45 UTC | |
by BrowserUk (Patriarch) on Aug 01, 2012 at 09:20 UTC | |
by Anonymous Monk on Aug 01, 2012 at 13:09 UTC | |
by BrowserUk (Patriarch) on Aug 01, 2012 at 13:43 UTC | |
by moritz (Cardinal) on Aug 01, 2012 at 08:55 UTC |