Re: finding and deleting repeated lines in a file

If the file is really huge and your can't simply slurp the lines into a huge hash as the keys then you have two choices depending on whether or not the order of lines in the file is important.

If order isn't important, ou can sort the file and then make a script that just removes consecutive repeated lines. This will save memory if your sort can work within the memory limitations.

If order is important and the lines are quite large you can use Digest::MD5 to create a checksum of each line and then use the array of checksums to compare all the lines of the file. This will save some memory.

I risk repeating what's already been said, but I think the previous posts were dancing around the issue.

Comment on Re: finding and deleting repeated lines in a file