Running out of memory...

knewter has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that has a large chunk of identical text occuring multiple times, and I'd like to remove it. I've written the following, which I think would work if not for the fact that it sucks up 1.2GB of RAM in about 6 minutes and then crashes (winXP)

#!/usr/bin/perl

use strict;

my $a;
my $content;

open E, "<C:/auburn_courses_replace.txt" or die "Unable to open file $
+!";

undef $/;

$a = <E>;

$/ = "\n";

close E;
open F, "<C:/auburn_courses.txt" or die "Unable to open file $!";
undef $/;

$content = <F>;

$/ = "\n";

close F;

open G, ">>C:/auburn_courses2.txt" or die "Unable to open file $!";

#print $content;
#print $a;

$content =~ s/$a//g;

print G $content;


close G;
[download]

Anyone know how to do what I'm trying to do and succeed? :)

Update! All fixed, ignore me

Comment on Running out of memory... Download Code

Replies are listed 'Best First'.
Re: Running out of memory... by BrowserUk (Patriarch) on Jan 29, 2005 at 19:20 UTC
Short answer--don't slurp the entire file. Longer answer. If the text you are removing extends across multiple lines, you may be able to use paragraph mode ($/ = ' ';) to effect processing the file in smaller chunks. If not, then you will need to use something like the sliding buffer technque demonstrated in Re: split and sysread() and towards the end of Optimising processing for large data files.. There are also other discussions of the "sliding buffer" technique on this site--try a super search for that term. Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply]
Re: Running out of memory... by vek (Prior) on Jan 29, 2005 at 19:59 UTC
When you do the following... `undef $/; $a = <E>;` [download] ...you are reading the entire file into memory. Try to break the task into smaller pieces. Without knowing what `auburn_courses.txt` or `auburn_courses2.txt` contain it's going to be tricky to give specific advice. -- vek --	[reply] [d/l] [select]
Re^2: Running out of memory... by knewter (Novice) on Jan 29, 2005 at 20:18 UTC
I've changed my code to the following: #!/usr/local/bin/perl -w use strict; open IN, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; open OUT, '> :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; my $a = '<!-- rsecftr.htm - Course Sections Table Footer -->'; my $b = '<!-- rsechdr.htm - Course Sections and Course Section Search +Table Header -->'; my $buffer; sysread IN, $buffer, 5800, 5800; do{ ## Move the second half of the buffer to the front. $buffer = substr( $buffer, 5800 ); ## and overwrite it with a new chunk sysread IN, $buffer, 5800, length( $buffer ); ## Apply the regex $buffer =~ s\|$b(.?)$a\|\|g; print $buffer; ## Write out the first half of the buffer syswrite OUT, $buffer, 5800; } until eof IN; close IN; close OUT; [download] auburn_courses.txt contains a load of html files all bunched one after the other...I'd like to remove the bits between the footer of one section that I want to see and the header of the next section that I'd like to see. They're delimited by the $a and $b lines. Update! All fixed, ignore me*	[reply] [d/l]
Re^3: Running out of memory... by BrowserUk (Patriarch) on Jan 29, 2005 at 21:11 UTC
I realise you have fixed your problem, but I have to say that without seeing your actual data, 5800 is a very strange choice of buffer size? Examine what is said, not who speaks. Silence betokens consent. Love the truth but pardon error.	[reply]
Re^4: Running out of memory... by knewter (Novice) on Mar 03, 2005 at 20:22 UTC