Re: Working on huge (GB sized) files

Your post is pretty light on specifics. Perhaps this will work: Process each file, emit intermediate file where each record is a single line and that single line has the "common field" replicated at the beginning. Do that to both files. Then concatenate files into one file with cat. Use system sort on command line for that file. Now all records that have the same "common field" are adjacent. Process that file to do what you want.

If the input files are in CSV, with the right options to the sort command, you can sort on an arbitrary field.

The system's sort command doesn't have to have all the data in memory at once and it will make temp files and do whatever it needs to do in order to sort this huge file. This can be faster than you might imagine. Your code only needs to deal with a small number of input lines at a time. Let system sort deal with the job of getting relevant records adjacent in the file.

Comment on Re: Working on huge (GB sized) files

Replies are listed 'Best First'.
Re^2: Working on huge (GB sized) files by vasavi (Initiate) on May 15, 2011 at 10:38 UTC
The approach (having common field at starting- concatenating files and sorting) you have mentioned has worked out for my requirement. Thank you!!!	[reply]