Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Split 2GB file and parsing

by siva kumar (Pilgrim)
on Nov 01, 2007 at 09:04 UTC ( [id://648452]=perlquestion: print w/replies, xml ) Need Help??

siva kumar has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a built-in-function that will filter some unwanted javascript, words ,etc., What I have to do is, just by passing the variable reference, that function will return the filtered content.

Here I have a 2GB file(database dump) to be get filtered. Reading 2 GB file gives me "out of memory" issue, So I am splitting the 2GB file into several 10MB files using linux "split" command.

After getting 10MB files, I am looping through each file, reading the file content , passing the content to filter, getting back the filtered content. The filtered content is appending to a variable.

This above process also taking considerable time. Please advice me whether I am doing the thing right Or any other better solution.

Thanks,
Sivakumar

Replies are listed 'Best First'.
Re: Split 2GB file and parsing
by GrandFather (Saint) on Nov 01, 2007 at 09:20 UTC

    What do you do with the processed data? If you are simply writing it back out as a new file (or replacing the original file), then you should read it with your Perl script a line at a time, process the line, then write it out. There are a couple of advantages in doing that. First you don't have to read and write the data multiple times. Second, you only read a (presumably) small amount of data at a time so the total memory used by your script remains small and doesn't start beating on the virtual memory system.

    It may help if you show us the code that you are using.


    Perl is environmentally friendly - it saves trees
Re: Split 2GB file and parsing
by andreas1234567 (Vicar) on Nov 01, 2007 at 09:39 UTC
    My old desktop parses 4.3 Gb in less than 2 minutes (without doing much useful, than is), so I guess there's optimizing potential. Less than 0.1% memory is used by that process.
    $ time perl -w use strict; my $FH = undef; open($FH, "<", "a.dvd.iso") or die "bah"; while (<$FH>) { print "hoho" if m/^ho/; } close $FH or die "doh"; __END__ hohohohohohohohohohohohohoh... real 1m40.487s user 0m32.321s sys 0m5.729s $ ls -lh a.dvd.iso -rw-r--r-- 1 user user 4.3G Oct 19 07:51 a.dvd.iso
    --
    Andreas
Re: Split 2GB file and parsing
by northwind (Hermit) on Nov 01, 2007 at 11:54 UTC

    Ok, first, what happens when something you want filtered is split across a file boundary? IMHO (based on the fact that I'm shooting in the dark because you did not post any relevant code) it would be much better to:

    1. Set $/=\5242880 (that's about 5MB).
    2. Open the original 2GB file.
    3. Do a single read (something like my $data = <BIGFILE>; (this would also likely be part of a while loop due to step 7)).
    4. Process the data.
    5. seek back xKB (where x is typically 1 to 2KB; this prevents you from missing something which straddles the read boundary).
    6. Read in another chunk.
    7. Repeat from step 4 until the entire file is processed.

    As far as your comment about the process "taking considerable time", we (the Monastary) cannot advise you on whether you are doing the right thing or not unless we see actual code.

Re: Split 2GB file and parsing
by clueless newbie (Curate) on Nov 01, 2007 at 17:11 UTC
    May a clueless newbie point out that he's reading a dump of a database? Perhaps it might be better to process the data one database record at a time?

      Which might mean one line at a time or virtualy anything else. In what format is the dump? Tab separated? CSV? What about newlines in the data? Or is it XML? Or? Or? "Database dump" doesn't really say much more than "file". We are shooting in the dark.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://648452]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (4)
As of 2024-04-25 06:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found