in reply to Using text files to remove duplicates in a web crawler
Instead of slurping in the entire %visited hash at once with storable, try using a lightweight database instead. You can use DBI and DBD::SQLite, for example. The point is that you probably need to give up on keeping the list of all sites you've visited in memory, so you need to come up with a solution that gives you quick random access to the data stored in more permanent storage like a hard drive. Flat files don't scale well, so a quick-and-dirty database would be ideal.
Each time you visit a site, plop its URL into the database. And each time you wish to consider following another link to a site, check to see if it's already in the database or not.
Dave
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Using text files to remove duplicates in a web crawler
by matija (Priest) on Jul 07, 2004 at 06:28 UTC | |
|
Re^2: Using text files to remove duplicates in a web crawler
by Scarborough (Hermit) on Jul 07, 2004 at 15:51 UTC |