Split 2GB file and parsing

siva kumar has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a built-in-function that will filter some unwanted javascript, words ,etc., What I have to do is, just by passing the variable reference, that function will return the filtered content.

Here I have a 2GB file(database dump) to be get filtered. Reading 2 GB file gives me "out of memory" issue, So I am splitting the 2GB file into several 10MB files using linux "split" command.

After getting 10MB files, I am looping through each file, reading the file content , passing the content to filter, getting back the filtered content. The filtered content is appending to a variable.

This above process also taking considerable time. Please advice me whether I am doing the thing right Or any other better solution.

Thanks,
Sivakumar

Comment on Split 2GB file and parsing

Replies are listed 'Best First'.

Re: Split 2GB file and parsing
by GrandFather (Saint) on Nov 01, 2007 at 09:20 UTC

What do you do with the processed data? If you are simply writing it back out as a new file (or replacing the original file), then you should read it with your Perl script a line at a time, process the line, then write it out. There are a couple of advantages in doing that. First you don't have to read and write the data multiple times. Second, you only read a (presumably) small amount of data at a time so the total memory used by your script remains small and doesn't start beating on the virtual memory system.

It may help if you show us the code that you are using.

Perl is environmentally friendly - it saves trees

[reply]

Re: Split 2GB file and parsing
by andreas1234567 (Vicar) on Nov 01, 2007 at 09:39 UTC

$ time perl -w
use strict;
my $FH = undef;
open($FH, "<", "a.dvd.iso") or die "bah";
while (<$FH>) {
  print "hoho" if m/^ho/;
}
close $FH or die "doh";
__END__
hohohohohohohohohohohohohoh...
real    1m40.487s
user    0m32.321s
sys     0m5.729s
$ ls -lh a.dvd.iso
-rw-r--r--  1 user user 4.3G Oct 19 07:51 a.dvd.iso
[download]

--
Andreas

[reply]
[d/l]

Re: Split 2GB file and parsing
by northwind (Hermit) on Nov 01, 2007 at 11:54 UTC

Ok, first, what happens when something you want filtered is split across a file boundary? IMHO (based on the fact that I'm shooting in the dark because you did not post any relevant code) it would be much better to:

Set $/=\5242880 (that's about 5MB).
Open the original 2GB file.
Do a single read (something like my $data = <BIGFILE>; (this would also likely be part of a while loop due to step 7)).
Process the data.
seek back xKB (where x is typically 1 to 2KB; this prevents you from missing something which straddles the read boundary).
Read in another chunk.
Repeat from step 4 until the entire file is processed.

As far as your comment about the process "taking considerable time", we (the Monastary) cannot advise you on whether you are doing the right thing or not unless we see actual code.

[reply]

Re: Split 2GB file and parsing
by clueless newbie (Curate) on Nov 01, 2007 at 17:11 UTC

May a clueless newbie point out that he's reading a dump of a database? Perhaps it might be better to process the data one database record at a time?

[reply]

Re^2: Split 2GB file and parsing

by Jenda (Abbot) on Nov 01, 2007 at 19:29 UTC

Which might mean one line at a time or virtualy anything else. In what format is the dump? Tab separated? CSV? What about newlines in the data? Or is it XML? Or? Or? "Database dump" doesn't really say much more than "file". We are shooting in the dark.

Jenda
Support Denmark!
Defend the free world!

[reply]


Perl: the Markov chain saw
	PerlMonks