in reply to question about algorithm
Also see part - split up files according to column value, which uses a hash. But it only uses a hash because it cannot assume that the file will be sorted according to the column in question.