Code

LinkChecker comprises the linkchecker executable and linkcheck package.

linkcheck

Main package for link checking.

Running

linkchecker provides the command-line arguments and reads a list of URLs from standard input, reads configuration files, drops privileges if run as root, initialises the chosen logger and collects an optional password.

Uses linkcheck.director.get_aggregate() to obtain an aggregate object linkcheck.director.aggregator.Aggregate that includes linkcheck.cache.urlqueue.UrlQueue, linkcheck.plugins.PluginManager and linkcheck.cache.results.ResultCache objects.

Adds URLs in the form of url_data objects to the aggregate’s urlqueue with linkcheck.cmdline.aggregate_url() which uses linkcheck.checker.get_url_from() to return a url_data object that is an instance of one of the linkcheck.checker classes derived from linkcheck.checker.urlbase.UrlBase, according to the URL scheme.

linkcheck.checker classes

Optionally initialises profiling.

Starts the checking with linkcheck.director.check_urls(), passing the aggregate.

Finally it counts any errors and exits with the appropriate code.

Checking & Parsing

That is:

  • Checking a link is valid

  • Parsing the document the link points to for new links

linkcheck.director.check_urls() authenticates with a login form if one is configured via linkcheck.director.aggregator.Aggregate.visit_loginurl(), starts logging with linkcheck.director.aggregator.Aggregate.logger.start_log_output() and calls linkcheck.director.aggregator.Aggregate.start_threads() which instantiates a linkcheck.director.checker.Checker object with the urlqueue if there is at least one thread configured, else it calls linkcheck.director.checker.check_urls() which loops through the entries in the urlqueue.

Either way linkcheck.director.checker.check_url() tests to see if url_data already has a result and whether the cache already has a result for that key. If not it calls url_data.check(), which calls url_data.check_content() that runs content plugins and returns do_parse according to url_data.do_check_content and linkcheck.checker.urlbase.UrlBase.allows_recursion() which includes linkcheck.checker.urlbase.UrlBase.allows_simple_recursion() that is monitoring the recursion level (with linkcheck.checker.urlbase.UrlBase.recursion_level). If do_parse is True, passes the url_data object to linkcheck.parser.parse_url() to call a linkcheck.parser.parse_ method according to the document type e.g. linkcheck.parser.parse_html() for HTML which calls linkcheck.htmlutil.linkparse.find_links() passing url_data.get_soup() and url_data.add_url. url_data.add_url puts the new url_data object on the urlqueue.