Code
LinkChecker comprises the linkchecker executable and linkcheck package.
Main package for link checking. |
Running
linkchecker provides the command-line arguments and reads a list of URLs from standard input, reads configuration files, drops privileges if run as root, initialises the chosen logger and collects an optional password.
Uses linkcheck.director.get_aggregate()
to obtain an aggregate object
linkcheck.director.aggregator.Aggregate
that includes linkcheck.cache.urlqueue.UrlQueue
,
linkcheck.plugins.PluginManager
and
linkcheck.cache.results.ResultCache
objects.
Adds URLs in the form of url_data objects to the aggregate’s urlqueue with
linkcheck.cmdline.aggregate_url()
which uses
linkcheck.checker.get_url_from()
to return a url_data object that is an instance
of one of the linkcheck.checker
classes derived from linkcheck.checker.urlbase.UrlBase
,
according to the URL scheme.
Optionally initialises profiling.
Starts the checking with linkcheck.director.check_urls()
, passing the aggregate.
Finally it counts any errors and exits with the appropriate code.
Checking & Parsing
That is:
Checking a link is valid
Parsing the document the link points to for new links
linkcheck.director.check_urls()
authenticates with a login form if one is configured
via linkcheck.director.aggregator.Aggregate.visit_loginurl()
, starts logging
with linkcheck.director.aggregator.Aggregate.logger.start_log_output()
and calls linkcheck.director.aggregator.Aggregate.start_threads()
which instantiates a
linkcheck.director.checker.Checker
object with the urlqueue if there is at
least one thread configured, else it calls
linkcheck.director.checker.check_urls()
which loops through the entries in the urlqueue.
Either way linkcheck.director.checker.check_url()
tests to see if url_data already has a result and
whether the cache already has a result for that key.
If not it calls url_data.check(),
which calls url_data.check_content() that runs content plugins and returns do_parse
according to url_data.do_check_content and linkcheck.checker.urlbase.UrlBase.allows_recursion()
which
includes linkcheck.checker.urlbase.UrlBase.allows_simple_recursion()
that is monitoring the recursion level
(with linkcheck.checker.urlbase.UrlBase.recursion_level
).
If do_parse is True, passes the url_data object to linkcheck.parser.parse_url()
to call a
linkcheck.parser.parse_ method according to the document type
e.g. linkcheck.parser.parse_html()
for HTML which calls linkcheck.htmlutil.linkparse.find_links()
passing url_data.get_soup() and url_data.add_url.
url_data.add_url puts the new url_data object on the urlqueue.