Re: 15 billion row text file and row deletes

How many serials are in the "file that has a list of serials to delete"?

If it is a relatively small number, you could read them into a hash. Then you could read through the 15 billion line file line-by-line (thereby avoiding the need to keep the whole thing in memory at once); if the line's serial is in the "delete" hash then read the next line, otherwise print it to a new output file.

You'll need enough drive space to accommodate the new output file, of course, but it would accomplish the goal without using a db and with only minimal memory requirements, and it would still require only a single pass through the 15 billion line file.

Comment on Re: 15 billion row text file and row deletes - Best Practice?

Replies are listed 'Best First'.
Re^2: 15 billion row text file and row deletes - Best Practice? by friedo (Prior) on Dec 01, 2006 at 05:24 UTC
If it is a relatively small number, you could read them into a hash. If it's a big number, you could read them into a disk-based hash like BerkeleyDB. It's a lot slower than an in-memory hash, of course, but it would make the code pretty easy to write. If it was me, though, I'd probably use a database.	[reply]
Re^3: 15 billion row text file and row deletes - Best Practice? by bobf (Monsignor) on Dec 01, 2006 at 05:35 UTC
That was my first thought, too (well, almost - I thought of DBM::Deep because I played with that one before). I would have suggested a db if the OP didn't clearly state "Without using a DB..." :-) It also depends on how this data is being used/processed, though. If this is a one-time filtering step then loading it all into a db, filtering, and exporting again could be very inefficient. OTOH, if this is just one of many steps that require searching though the data, a db could be better.	[reply]
Re^3: 15 billion row text file and row deletes - Best Practice? by awohld (Hermit) on Dec 01, 2006 at 05:29 UTC
The 15 billion serials are unique. I'm just nervous about using one since I don't know how big the overhead will be. I have 680GB on disk free so maybe I should use a DB.	[reply]
Re^4: 15 billion row text file and row deletes - Best Practice? by davido (Cardinal) on Dec 01, 2006 at 06:03 UTC
But how many serials do you have to delete? You don't need to hold the master file all in a hash or database at once, if the delete-list is small enough to fit into memory. Iterate over the input file one line at a time. Each line, check the delete-hash to see if this is a line that you need to eliminate. If it is, `next;`, otherwise, print to your new output file. Move on to the next line... lather, rinse, repeat. If practical, hold the delete list in an in-memory hash. If it's not practical to do so, hold the delete list in a database. But leave the master list in a flat file. Dave	[reply] [d/l]
Re^5: 15 billion row text file and row deletes - Best Practice? by jhourcle (Prior) on Dec 01, 2006 at 15:14 UTC
Re^6: 15 billion row text file and row deletes - Best Practice? by djp (Hermit) on Dec 04, 2006 at 02:36 UTC