in reply to Creating a web crawler (theory)

AM wrote: because webmasters don't always use FULL URLS like they should.

Certainly, there are reasons for using full URLs occasionally BUT WHERE DID YOU GET THAT IDEA? (That's not purely sarcasm. If you can offer an authority for that, I'd like to read it!)

IIRC, a full URL forces the visitors browser to revisit the DNS server, creating needless traffic and slowing rendering. (see brian_d_foy's reply below re DNS revisits: He's right and I clearly IDidNotRC ...but I believe the balance of this post can stand!)

However, you have a number of good answers on how to deal with your generic question, and good suggestions for dealing with relative links.

But you may want to consider the volume of data you're apt to deal with. One of my sites has ~1600 pages, and well over 5000 links. I can collect those links with a script -- ON A LOCAL MIRROR (ie, no net time and no competition for the server's attention)-- in about 15 seconds but I can't even guess what the time required would be if one were to try to chase down all the links on the secondary, tertiary, etc, pages...

Replies are listed 'Best First'.
Re^2: Creating a web crawler (theory)
by brian_d_foy (Abbot) on Jan 28, 2005 at 21:01 UTC

    A full URL forces the browser to revisit DNS? Where did you get that idea? Even if you have some wacky set-up where you aren't caching replies, it doesn't affect rendering. As for needless traffic, a DNS query isn't much compared to all those images we ask our browser to download.

    Relative URLs are a convenience for our typing. To follow a link, the browser still needs to make it an absolute URL, then go where that URL says. A relative URL in an HTML page is not a secret signal to the browser to use some sort of quick fetching algorithm.

    You might be thinking about the difference between external and internal redirections. An external redirection is a full HTTP response that cause the user-agent to fetch the resource from a different URL. An internal redirection can be caught by the web server and handled without another request from the user-agent. Neither of these have anything to do with HTML though.

    --
    brian d foy <bdfoy@cpan.org>
      Relative URLs do better than save typing. They save retyping. If you move a project inside a site or just rename it, with relative paths you don't have to hunt down all the links and change them.

        You don't have to hunt down and re-type the links when you use a Perl script to do it for you. :)

        --
        brian d foy <bdfoy@cpan.org>