Another way to remove duplicates is to just use the command line sort. Command line sort is not limited to having the entire file memory resident and can sort a HUGE file. Then cycle through that sorted file and don't output lines if the current line matched the immediately preceding line.