in reply to Another "out of memory!" problem
Hi all, thank you for your interesting replies. To answer a couple of questions, this is actually for an internal site (actually several large collections of information, Eclipse instances) belonging to my employer. We're trying to catalog every page in every Eclipse instance, in part so that we can see which pages have zero hits. The left nav has a TOC tree that the script can traverse, but not all pages are in the TOC, so the script opens each page and scans for any hrefs not in the TOC, follows them down recursively, and adds them to the list. It's interesting to hear that 27,000 is not really that big, so I'm suspecting something else is going on.
In answer to the question why don't I use a database, I'm actually saving the URLs (and some other data from each page) to a CSV file, so other than the URL itself not much is getting written to memory. I'm not sure how a db would solve that -- I'd still need the hash to check for already-cataloged URLs, or else I'd have to query the db for each one.
Anyway I'll look through the interesting suggestions and see if I can figure out something that works for me. Thanks.
Scott
|
---|