Instead of slurping in the entire %visited hash at once with storable, try using a lightweight database instead. You can use DBI and DBD::SQLite, for example. The point is that you probably need to give up on keeping the list of all sites you've visited in memory, so you need to come up with a solution that gives you quick random access to the data stored in more permanent storage like a hard drive. Flat files don't scale well, so a quick-and-dirty database would be ideal.
Each time you visit a site, plop its URL into the database. And each time you wish to consider following another link to a site, check to see if it's already in the database or not.
Dave
In reply to Re: Using text files to remove duplicates in a web crawler
by davido
in thread Using text files to remove duplicates in a web crawler
by mkurtis
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |