And using regexes would probably be a pain because webmasters don't always use FULL URLS like they should.
Erm, no relative URLs are perfectly valid. Do you really think it'd be a good idea to have a hyooman explicitly add http://www.wherever.com/six/levels/deep/into/some/path/ to the front of every URI? Not every page is automatically generated.
At any rate, see the new_abs method from URI for how to handle these easily.
In reply to Re: Creating a web crawler (theory)
by Fletch
in thread Creating a web crawler (theory)
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |