in reply to Re: Find duplicate lines from the file and write it into new file.
in thread Find duplicate lines from the file and write it into new file.
Divide available memory by the number of lines in the file (or more accurately, divide by the average line length if you want to waste time on the computation)
That should give you the approximate amount of memory you have per line to work with (roughly...you'll probably need to adjust for overhead).
If that number is 16 bytes or greater, just use the digest method. If not, do multiple passes doing piecewise duplicate checking,
For instance, if it turns out you only have 6 bytes of memory available per line, well then do one run of the data where you treat only the first 6 characters of the line as the line (ie, store the first 6 characters of the line in the hash, and use that to check for dups against the first 6 characters of each line of the rest of the file).
Use that to create a new file.
That should produce a smaller file of lines that have duplicates amidst the first 6 characters.
Since you now have a smaller (but still lossless) file to work with, you can then run another sweep on the new file checking a larger number of characters.
Repeat until accurate.
It's ugly, and disk expensive, but if you really don't have the memory available, it may be the only way to accomplish the task.
|
|---|