in reply to Re^2: 15 billion row text file and row deletes - Best Practice?
in thread 15 billion row text file and row deletes - Best Practice?

The 15 billion serials are unique. I'm just nervous about using one since I don't know how big the overhead will be. I have 680GB on disk free so maybe I should use a DB.
  • Comment on Re^3: 15 billion row text file and row deletes - Best Practice?

Replies are listed 'Best First'.
Re^4: 15 billion row text file and row deletes - Best Practice?
by davido (Cardinal) on Dec 01, 2006 at 06:03 UTC

    But how many serials do you have to delete? You don't need to hold the master file all in a hash or database at once, if the delete-list is small enough to fit into memory. Iterate over the input file one line at a time. Each line, check the delete-hash to see if this is a line that you need to eliminate. If it is, next;, otherwise, print to your new output file. Move on to the next line... lather, rinse, repeat.

    If practical, hold the delete list in an in-memory hash. If it's not practical to do so, hold the delete list in a database. But leave the master list in a flat file.


    Dave

      And if the list of items to delete is _really_ small, I just use egrep from the command line:

      egrep -v '^(item1|item2|item3|item4|item5),' input_file > output_file
        Or use the -f option to GNU grep:
        grep -E -v -f deletes.txt infile > outfile