linkcheck.checker.httpurl
Handle http links.
Classes
|
Url link with http scheme. |
- class linkcheck.checker.httpurl.HttpUrl(base_url, recursion_level, aggregate, parent_url=None, base_ref=None, line=-1, column=-1, page=-1, name='', url_encoding=None, extern=None)[source]
Bases:
InternPatternUrl
Url link with http scheme.
Initialize check data, and store given variables.
- Parameters:
base_url – unquoted and possibly unnormed url
recursion_level – on what check level lies the base url
aggregate – aggregate instance
parent_url – quoted and normed url of parent or None
base_ref – quoted and normed url of <base href=””> or None
line – line number of url in parent content
column – column number of url in parent content
page – page number of url in parent content
name – name of url or empty
url_encoding – encoding of URL or None
extern – None or (is_extern, is_strict)
- allows_robots(url)[source]
Fetch and parse the robots.txt of given url. Checks if LinkChecker can get the requested resource content.
- Parameters:
url (string) – the url to be requested
- Returns:
True if access is granted, otherwise False
- Return type:
bool
- check_connection()[source]
Check a URL with HTTP protocol. Here is an excerpt from RFC 1945 with common response codes: The first digit of the Status-Code defines the class of response. The last two digits do not have any categorization role. There are 5 values for the first digit:
1xx: Informational - Not used, but reserved for future use
2xx: Success - The action was successfully received, understood, and accepted
3xx: Redirection - Further action must be taken in order to complete the request
4xx: Client Error - The request contains bad syntax or cannot be fulfilled
5xx: Server Error - The server failed to fulfill an apparently valid request
- construct_auth()[source]
Construct HTTP Basic authentication credentials if there is user/password information available. Does not overwrite if credentials have already been constructed.
- content_allows_robots()[source]
Return False if the content of this URL forbids robots to search for recursive links.
- get_request_kwargs()[source]
Construct keyword parameters for Session.request() and Session.resolve_redirects().
- get_robots_txt_url()[source]
Get the according robots.txt URL for this URL.
- Returns:
robots.txt URL
- Return type:
string