I don't really like having to depend on the files being sorted. One alternative way to remove duplicate data is to use a hash to temporarily hold your data. You can read in the data from the files, place it in a hash, and then (eventually) write it back out again. Since hashes depend on a unique key, you'll overwrite any prior duplicate data rows in the hash and end up with only one copy of each unique element.