Verifying Links in HTML

agent00013 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid) Re: Verifying Links in HTML by Ovid (Cardinal) on Jun 13, 2001 at 23:51 UTC
I would use a hash to store each visited link and use the exists function before visiting a link. The main issues would be: You probably want to ensure that you have absolute links (not relative ones), in the hash. Ensure that any links that point to scripts have the query string stripped. Rough guess as to the best way to do that: `$link =~ s/\?.$//; $visited{ $link } = '';` [download] Without knowing how* you're getting into the infinite loop, I can only offer one other suggestion: for the `%visited` hash, you may have a problem if a redirection occurs as you'll resolve to the wrong URL. If that's likely to be an issue, you'll have to use LWP::UserAgent and examine the response code. Cheers, Ovid Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.	[reply] [d/l]
Re: Verifying Links in HTML by Masem (Monsignor) on Jun 13, 2001 at 23:49 UTC
Use a hash. If the link already exists in the hash, you can skip it. Otherwise process the link and add it to the hash. Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain	[reply]
Re: Verifying Links in HTML by perigeeV (Hermit) on Jun 14, 2001 at 02:19 UTC
For what its worth, linkcheck is available on CPAN. I love CPAN. CPAN makes me happy.	[reply]
Re: Verifying Links in HTML by Hero Zzyzzx (Curate) on Jun 13, 2001 at 23:50 UTC
A general question gets a general answer. . . Are you going to be checking static or dynamic pages? This "infinite loop" thing it probably more of an issue for dynamic pages, frequently there are multiple ways of looking at a page (normal, printable, lo-bandwidth, whatever) so you'd need to take this into account. If you're spidering static pages, what about dumping the address of the page into a DB, and then checking a new page to verify against the DB. If the page exists, then skip it. If they are dynamic pages, and each document has a unique id of some sort, you could parse out the ID and store that in the DB. Then the multiple views thing would be moot, because the ID is what you're matching on. update:A hash would probably be better speed wise. With a DB you could have the invalid links stored there and report on them later easily.	[reply]
Re: Verifying Links in HTML by agent00013 (Pilgrim) on Jun 14, 2001 at 00:07 UTC
The pages I'm checking are static. I was attempting to use an array to keep track of visited links but then I was having problems when testing to see if the page was already in there, etc. Something wrong with my regular expressions. I think the hash will work better though. Thanx for your suggestions. For a string comprised of "http://www.whatever.html/directory/" how would I right a regular expression that'd make this equivalent to "http://www.whatever.html/directory/index.html"? Essentially I'd like to do this so that it doesn't test the same file twice with different names.	[reply]
(bbfu) Re2: Verifying Links in HTML by bbfu (Curate) on Jun 14, 2001 at 10:45 UTC
`$string = 'http://www.whatever.html/directory/'; $string .= 'index.html' if(substr($string, -1) eq '/');` [download] bbfu Seasons don't fear The Reaper. Nor do the wind, the sun, and the rain. We can be like they are.	[reply] [d/l]