If you are seeking to create a link-checker, just search here for "Link check" and enjoy the many responses.
merlyn has written at least two articles on link checkers.
One way of doing it is to take an HTML page:
- Run it through HTML::LinkExtor to get the links (you may want to filter out the image links and mailto: references.) Don't forget to provide the server and base path if it's required. (see docs)
- Use either LWP::Simple or LWP::UserAgent to check those links. If you are just checking for "liveness", a HEAD request will suffice (and be kinder to your bandwidth). If you are spidering, you'd want to do a GET request.
- If you are spidering, you then do other steps such as adding new HTML pages to your queue of pages to check, watching the depth (how far from your original page you are), watching what server you're on so that you aren't trying to index the entire Internet, checking that you only index a given page once, respecting the rules given in robots.txt, etc.
By and large, if you just want a simple link-checker, go ahead and roll your own, it's a good simple learning experience. If you are trying to spider more than a page or two, you should probably not reinvent the wheel, so start with someone else's work.
Perldoc lwpcook has some basics, but it's best to figure out what you are trying to do, then look up how to do it, just as you don't cook by reading the cookbook cover-to-cover.
Be sure to read the threads that turn up in the search, this is territory that's been well covered.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.