siva kumar has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I have a built-in-function that will filter some unwanted javascript, words ,etc., What I have to do is, just by passing the variable reference, that function will return the filtered content.
Here I have a 2GB file(database dump) to be get filtered. Reading 2 GB file gives me "out of memory" issue, So I am splitting the 2GB file into several 10MB files using linux "split" command.
After getting 10MB files, I am looping through each file, reading the file content , passing the content to filter, getting back the filtered content. The filtered content is appending to a variable.
This above process also taking considerable time. Please advice me whether I am doing the thing right Or any other better solution.
Thanks,
Sivakumar
Re: Split 2GB file and parsing
by GrandFather (Saint) on Nov 01, 2007 at 09:20 UTC
|
What do you do with the processed data? If you are simply writing it back out as a new file (or replacing the original file), then you should read it with your Perl script a line at a time, process the line, then write it out. There are a couple of advantages in doing that. First you don't have to read and write the data multiple times. Second, you only read a (presumably) small amount of data at a time so the total memory used by your script remains small and doesn't start beating on the virtual memory system.
It may help if you show us the code that you are using.
Perl is environmentally friendly - it saves trees
| [reply] |
Re: Split 2GB file and parsing
by andreas1234567 (Vicar) on Nov 01, 2007 at 09:39 UTC
|
My old desktop parses 4.3 Gb in less than 2 minutes (without doing much useful, than is), so I guess there's optimizing potential. Less than 0.1% memory is used by that process.
$ time perl -w
use strict;
my $FH = undef;
open($FH, "<", "a.dvd.iso") or die "bah";
while (<$FH>) {
print "hoho" if m/^ho/;
}
close $FH or die "doh";
__END__
hohohohohohohohohohohohohoh...
real 1m40.487s
user 0m32.321s
sys 0m5.729s
$ ls -lh a.dvd.iso
-rw-r--r-- 1 user user 4.3G Oct 19 07:51 a.dvd.iso
| [reply] [d/l] |
Re: Split 2GB file and parsing
by northwind (Hermit) on Nov 01, 2007 at 11:54 UTC
|
Ok, first, what happens when something you want filtered is split across a file boundary? IMHO (based on the fact that I'm shooting in the dark because you did not post any relevant code) it would be much better to:
- Set $/=\5242880 (that's about 5MB).
- Open the original 2GB file.
- Do a single read (something like my $data = <BIGFILE>; (this would also likely be part of a while loop due to step 7)).
- Process the data.
- seek back xKB (where x is typically 1 to 2KB; this prevents you from missing something which straddles the read boundary).
- Read in another chunk.
- Repeat from step 4 until the entire file is processed.
As far as your comment about the process "taking considerable time", we (the Monastary) cannot advise you on whether you are doing the right thing or not unless we see actual code.
| [reply] |
Re: Split 2GB file and parsing
by clueless newbie (Curate) on Nov 01, 2007 at 17:11 UTC
|
May a clueless newbie point out that he's reading a dump of a database?
Perhaps it might be better to process the data one database record at a time? | [reply] |
|
| [reply] |
|