I wrote a script last year to check a database of around a thousand external links: simple stuff using DBI and LWP. Each week, the script looks for problems with these sites and mails the database maintainers with any problems it encounters.
We decided to implement a simple check initially, but we discussed possible future ideas and we've also come up with more based on our experience:
- Differentiate between different types of errors (DNS lookup, server error, page not found or removed, permanent redirection). Maybe re-test links with temporary failures after a few hours.
- Record in the database when the link last worked.
- Allow maintainers to flag links as not working, and instead of reporting failure for such links, report when they succeed. Users searching the database should not see such links in response to their queries.
- Use Net::Whois to notify changes in domain ownership and notify us in advance if a domain is about to expire. Certain unethical business people like to register newly expired domains and replace the content with things we don't want to link to.
- Just because a site returns an HTTP success code, that dosn't mean everything works fine. At present, maintainers check the links manually every now and again. We don't want to alert the maintainers every time a page changes, especially for dynamic content, but we might come up with a useful heuristic that searches to see if certain key phrases still exist (or don't exist for phrases like "page removed").
On a separate project, I found XML::LibXML more convenient than HTML::Parser for screen scraping by using its XPath querying method, which even works with badly formed XML and HTML. I find XPath really useful for this kind of thing.