I wrote a script last year to check a database of around a thousand external links: simple stuff using DBI and LWP. Each week, the script looks for problems with these sites and mails the database maintainers with any problems it encounters.
We decided to implement a simple check initially, but we discussed possible future ideas and we've also come up with more based on our experience:
- Differentiate between different types of errors (DNS lookup, server error, page not found or removed, permanent redirection). Maybe re-test links with temporary failures after a few hours.
- Record in the database when the link last worked.
- Allow maintainers to flag links as not working, and instead of reporting failure for such links, report when they succeed. Users searching the database should not see such links in response to their queries.
- Use Net::Whois to notify changes in domain ownership and notify us in advance if a domain is about to expire. Certain unethical business people like to register newly expired domains and replace the content with things we don't want to link to.
- Just because a site returns an HTTP success code, that dosn't mean everything works fine. At present, maintainers check the links manually every now and again. We don't want to alert the maintainers every time a page changes, especially for dynamic content, but we might come up with a useful heuristic that searches to see if certain key phrases still exist (or don't exist for phrases like "page removed").
On a separate project, I found XML::LibXML more convenient than HTML::Parser for screen scraping by using its XPath querying method, which even works with badly formed XML and HTML. I find XPath really useful for this kind of thing.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.