comment on

I think the Digest method is ideal if you have the memory for it. If 16 bytes per line still consumes too much memory, you could take a piecewise approach.

Divide available memory by the number of lines in the file (or more accurately, divide by the average line length if you want to waste time on the computation)
That should give you the approximate amount of memory you have per line to work with (roughly...you'll probably need to adjust for overhead).
If that number is 16 bytes or greater, just use the digest method. If not, do multiple passes doing piecewise duplicate checking,

For instance, if it turns out you only have 6 bytes of memory available per line, well then do one run of the data where you treat only the first 6 characters of the line as the line (ie, store the first 6 characters of the line in the hash, and use that to check for dups against the first 6 characters of each line of the rest of the file).
Use that to create a new file.
That should produce a smaller file of lines that have duplicates amidst the first 6 characters.
Since you now have a smaller (but still lossless) file to work with, you can then run another sweep on the new file checking a larger number of characters.
Repeat until accurate.
It's ugly, and disk expensive, but if you really don't have the memory available, it may be the only way to accomplish the task.

In reply to Re^2: Find duplicate lines from the file and write it into new file. by wojtyk
in thread Find duplicate lines from the file and write it into new file. by anna_here

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.