Re: Using text files to remove duplicates in a web crawler

Instead of slurping in the entire %visited hash at once with storable, try using a lightweight database instead. You can use DBI and DBD::SQLite, for example. The point is that you probably need to give up on keeping the list of all sites you've visited in memory, so you need to come up with a solution that gives you quick random access to the data stored in more permanent storage like a hard drive. Flat files don't scale well, so a quick-and-dirty database would be ideal.

Each time you visit a site, plop its URL into the database. And each time you wish to consider following another link to a site, check to see if it's already in the database or not.

Dave

Comment on Re: Using text files to remove duplicates in a web crawler

Replies are listed 'Best First'.
Re^2: Using text files to remove duplicates in a web crawler by matija (Priest) on Jul 07, 2004 at 06:28 UTC
Using a cpan:://DBI database will work, but using a tied hash with a DB_File or similar backing will be an order of magnitude faster - as well as simpler to script.	[reply]
Re^2: Using text files to remove duplicates in a web crawler by Scarborough (Hermit) on Jul 07, 2004 at 15:51 UTC
I whole heartedly agree with the above, and once in a database the infomation can be accessed by lots of other applications as well. Surely a massive advantage.	[reply]