in reply to Are two lines in the text file equal
With a file containing "millions" of lines, your biggest problem is memory consumption.
1 million lines of 80 chars and you've got 80 MB if you load the file as a scalar. Put it into an array, and this rises to something like 104 MB. To detect duplicates, you really need to use a hash. For this, with keys of 80 characters, a quick experiment shows that you will require approximately 160 bytes perl line, or 150 MB for a file of one million lines. And this is storing undef as the value in the hash, when you probably want to be storing either the line number or preferable the byte offset of the start of the line.
If your lines are more than 80 characters, or your file contains multiple millions of lines, the memory consumption is likely to grow beyond the capacity of your machine. So, you need to look for alternatives. One such possibility is to create a digest from each line. Using a binary MD5 digest generated from each line will give you 16 byte keys rather than 80, which sounds like it will allow you to store a 5x bigger file for a given amount of memory. Unfortunately, this isn't so. Using 16 byte keys, still requires approx. 120 bytes/line or 115 MB. Again, this is using undef for the values, which isn't useful as it would only tell you that you had duplicates and their MD5, but not where in the file they were, and would require another pass to regenerate the MD5s for each line until you found the duplicates. The MD5 algorithm itself isn't that quick either. There is also no guarantee that every line will generate a unique MD5, so you would need to retain the line numbers or byte offsets of each line that generated each MD5 so that you can go back and check that they are indeed duplicates. This will again increase the memory usage and reduce the size of the files you can handle.
Ultimately, unless you have huge amounts of memory, you're probably better off sorting the file and then processing the sorted file comparing consecutive lines in order to find the duplicates, and then re-process the original file to find their original context if that is a requirement.
If you have a system sort facility that can handle the volumes of data involved, then use that, and the uniq command will file the dups if you have it. This will almost certainly be quicker than anything you write in perl.
However, if you don't have a sort facility that can handle the volume, then you might try this split/sort/merge code I wrote a while back. It is fairly simplistic, and could definitely be improved from a performance standpoint, but it runs well enough that I have never needed to improve it. YMMV:)
|
|---|