comment on

If you're going to be processing many thousands of URLs (25K seems like it satisfies this condition), you may want to try something called a Bloom Filter, for which a sample module already exists on CPAN (Bloom::Filter).

A Bloom Filter a simplified hashtable that can test only presence/absence (unlike a full-blown hashtable which can associate arbitrary data, such as a scalar, with each entry). They can encode this information in FAR less space than a normal hashtable.

Be warned! Bloom Filters are not perfect; they have a very small, but existent error rate of false positives/negatives that occur. By increasing the amount of storage per entry (decreasing the effective compression), however, the error rate can be reduced to quite reasonable proportions. As this article indicates, you can encode up to 1 *million* URLs and have a 1-in-1-million false positive rate while storing only 6 bytes per key. That's LESS than the 'www.' and '.com' you're almost guaranteed to find in every URL (8 bytes).

--Stevie-O

$"=$,,$_=q>|\p4<6 8p<M/_|<('=>
.q>.<4-KI<l|2$<6%s!<qn#F<>;$,
.=pack'N*',"@{[unpack'C*',$_]
}"for split/</;$_=$,,y[A-Z a-z]
         {}cd;print lc
[download]

In reply to Re: Using text files to remove duplicates in a web crawler by Stevie-O
in thread Using text files to remove duplicates in a web crawler by mkurtis

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.