Code
LinkChecker comprises the linkchecker executable and linkcheck package.
Main package for link checking. |
Running
linkchecker provides the command-line arguments and reads a list of URLs from standard input, reads configuration files, drops privileges if run as root, initialises the chosen logger and collects an optional password.
Uses linkcheck.director.get_aggregate() to obtain an aggregate object
linkcheck.director.aggregator.Aggregate
that includes linkcheck.cache.urlqueue.UrlQueue,
linkcheck.plugins.PluginManager and
linkcheck.cache.results.ResultCache objects.
Adds URLs in the form of url_data objects to the aggregate’s urlqueue with
linkcheck.cmdline.aggregate_url() which uses
linkcheck.checker.get_url_from() to return a url_data object that is an instance
of one of the linkcheck.checker classes derived from linkcheck.checker.urlbase.UrlBase,
according to the URL scheme.
Optionally initialises profiling.
Starts the checking with linkcheck.director.check_urls(), passing the aggregate.
Finally it counts any errors and exits with the appropriate code.
Checking & Parsing
That is:
Checking a link is valid
Parsing the document the link points to for new links
linkcheck.director.check_urls() authenticates with a login form if one is configured
via linkcheck.director.aggregator.Aggregate.visit_loginurl(), starts logging
with linkcheck.director.aggregator.Aggregate.logger.start_log_output()
and calls linkcheck.director.aggregator.Aggregate.start_threads() which instantiates a
linkcheck.director.checker.Checker object with the urlqueue if there is at
least one thread configured, else it calls
linkcheck.director.checker.check_urls() which loops through the entries in the urlqueue.
Either way linkcheck.director.checker.check_url() tests to see if url_data already has a result and
whether the cache already has a result for that key.
If not it calls url_data.check(),
which calls url_data.check_content() that runs content plugins and returns do_parse
according to url_data.do_check_content and linkcheck.checker.urlbase.UrlBase.allows_recursion() which
includes linkcheck.checker.urlbase.UrlBase.allows_simple_recursion() that is monitoring the recursion level
(with linkcheck.checker.urlbase.UrlBase.recursion_level).
If do_parse is True, passes the url_data object to linkcheck.parser.parse_url() to call a
linkcheck.parser.parse_ method according to the document type
e.g. linkcheck.parser.parse_html() for HTML which calls linkcheck.htmlutil.linkparse.find_links()
passing url_data.get_soup() and url_data.add_url.
url_data.add_url puts the new url_data object on the urlqueue.