in reply to fast lookups in files

Since you know that the large-file is already sorted, the most efficient processing technique would be to sort the other input file(s) by the same key. Then, you can process the two streams side-by-side sequentially: no “search” is involved.

If it is at all possible to do this, then this is what you should do. “Inconvenience yourself” to do it this way:   you'll be glad you did.

You do not have to worry about updating the original file. Write the changes to another file in the same format, then sort that file by the same key, then merge them into the original file. In each case, you're doing the job by means of sorts (which are unexpectedly fast), and sequential reads. When you're finished, you'll have the original input master-file (unchanged), the delta-file (now sorted), and the updated master-file.

Yes, that is exactly how data-processing was done, using punched cards, long before digital computers were invented... And it worked.

Failing that, an appropriate strategy would be to use something like a DB_File i(e.g./i a Berkeley DB) ... but beware. Random seeks are time-consuming in large quantities.

Replies are listed 'Best First'.
Re^2: fast lookups in files
by BrowserUk (Patriarch) on Feb 06, 2008 at 16:17 UTC