Re: Verifying Links in HTML

A general question gets a general answer. . .
Are you going to be checking static or dynamic pages? This "infinite loop" thing it probably more of an issue for dynamic pages, frequently there are multiple ways of looking at a page (normal, printable, lo-bandwidth, whatever) so you'd need to take this into account.

If you're spidering static pages, what about dumping the address of the page into a DB, and then checking a new page to verify against the DB. If the page exists, then skip it.

If they are dynamic pages, and each document has a unique id of some sort, you could parse out the ID and store that in the DB. Then the multiple views thing would be moot, because the ID is what you're matching on.

update:A hash would probably be better speed wise. With a DB you could have the invalid links stored there and report on them later easily.

Comment on Re: Verifying Links in HTML