in reply to 15 billion row text file and row deletes - Best Practice?

Another strategy that no one's mentioned is to split the incoming file into smaller chunks and work on each chunk.

But it looks like you have plenty approaches from which to choose -- although it would have been nice to know how large your kill file was, to put the problem into better perspective.

Alex / talexb / Toronto

"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

  • Comment on Re: 15 billion row text file and row deletes - Best Practice?

Replies are listed 'Best First'.
Re^2: 15 billion row text file and row deletes - Best Practice?
by tubaandy (Deacon) on Dec 01, 2006 at 15:42 UTC
    Alex has a good point, you could put together a script to grab chunks of the big original file (say 1 million line chunks) and write that to another temp file. Then follow the method where you read the deletes into a hash, and parse through the temp file, appending the good lines to the final file. This way, you'll have 4 files at any one time: the original file, the delete file, the chunk temp file, and the final file. Then again, you'll be butting up against your disk space limit...

    Anyway, just a thought.

    tubaandy